Print

Print


On 4/18/2012 11:09 AM, Doran, Michael D wrote:
> I don't believe that is the case.  Take UTF-8 out of the picture, and consider the MARC-8 character set with its escape sequences and combining characters.  A character such as an "n" with a tilde would consist of two bytes.  The Greek small letter alpha, if invoked in accordance with ANSI X3.41, would consist of five bytes (two bytes for the initial escape sequence, a byte for the character, and then two bytes for the escape sequence returning to the default character set).

ISO 2709 doesn't care how many bytes your characters are. The directory 
and offsets and other things count bytes, not characters. (which was, in 
my opinion, the _right_ decision, for once with marc!)

How bytes translate into characters is not a concern of ISO 2709.

The majority of non-7-bit-ASCII encodings will have chars that are more 
than one byte, either sometimes or always. This is true of MARC8 (some 
chars), UTF8 (some chars), and UTF16 (all chars), all of them. (It is 
not true of Latin-1 though, for instance, I don't think).

ISO 2709 doesn't care what char encodings you use, and there's no 
standard ISO 2709 way to determine what char encodings are used for 
_data_ in the MARC record. ISO 2709 does say that _structural_ elements 
like field names, subfield names, the directory itself, seperator chars, 
etc, all need to be (essentially, over-simplifying) 7-bit-ASCII. The 
actual data itself is application dependent, 2709 doesn't care, and 2709 
doesn't give any standard cross-2709 way to determine it.

That is my conclusion at the moment, helped by all of you all in this 
thread, thanks!