LISTSERV 16.5 - CODE4LIB Archives

On 4/18/2012 11:09 AM, Doran, Michael D wrote:
> I don't believe that is the case. Take UTF-8 out of the picture, and consider the MARC-8 character set with its escape sequences and combining characters. A character such as an "n" with a tilde would consist of two bytes. The Greek small letter alpha, if invoked in accordance with ANSI X3.41, would consist of five bytes (two bytes for the initial escape sequence, a byte for the character, and then two bytes for the escape sequence returning to the default character set).

ISO 2709 doesn't care how many bytes your characters are. The directory
and offsets and other things count bytes, not characters. (which was, in
my opinion, the _right_ decision, for once with marc!)

How bytes translate into characters is not a concern of ISO 2709.

The majority of non-7-bit-ASCII encodings will have chars that are more
than one byte, either sometimes or always. This is true of MARC8 (some
chars), UTF8 (some chars), and UTF16 (all chars), all of them. (It is
not true of Latin-1 though, for instance, I don't think).

ISO 2709 doesn't care what char encodings you use, and there's no
standard ISO 2709 way to determine what char encodings are used for
_data_ in the MARC record. ISO 2709 does say that _structural_ elements
like field names, subfield names, the directory itself, seperator chars,
etc, all need to be (essentially, over-simplifying) 7-bit-ASCII. The
actual data itself is application dependent, 2709 doesn't care, and 2709
doesn't give any standard cross-2709 way to determine it.

That is my conclusion at the moment, helped by all of you all in this
thread, thanks!