Print

Print


In practice it seems to mean UTF-8. At least I've only seen UTF-8, and I can't imagine the code that processes this stuff being safe for UTF-16 or UTF-32. All of the offsets are byte-oriented, and there's too much legacy code that makes assumption about null-terminated strings.

-Tod

On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote:

> Okay, forget XML for a moment, let's just look at marc 'binary'.
> 
> First, for Anglophone-centric MARC21.
> 
> The LC docs don't actually say quite what I thought about leader byte 09, used to advertise encoding:
> 
> 
> a - UCS/Unicode
> Character coding in the record makes use of characters from the Universal Coded Character Set (UCS) (ISO 10646), or Unicode™, an industry subset.
> 
> 
> 
> That doesn't say UTF-8. It says UCS or "Unicode". What does that actually mean?  Does it mean UTF-8, or does it mean UTF-16 (closer to what used to be called "UCS" I think?).  Whatever it actually means, do people violate it in the wild?
> 
> 
> 
> Now we get to non-Anglophone centric marc. I think all of which is ISO_2709?  A standard which of course is not open access, so I can't get it to see what it says.
> 
> But leader 09 being used for encoding -- is that Marc21 specific, or is it true of any ISO-2709?  Marc8 and "unicode" being the only valid encodings can't be true of any ISO-2709, right?
> 
> Is there a generic ISO-2709 way to deal with this, or not so much?