> -----Original Message----- > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of > Jonathan Rochkind > Sent: Tuesday, April 17, 2012 19:55 > To: [log in to unmask] > Subject: [CODE4LIB] more on MARC char encoding: Now we're about > ISO_2709 and MARC21 > > Okay, forget XML for a moment, let's just look at marc 'binary'. > > First, for Anglophone-centric MARC21. > > The LC docs don't actually say quite what I thought about leader byte > 09, used to advertise encoding: > > > a - UCS/Unicode > Character coding in the record makes use of characters from the > Universal Coded Character Set (UCS) (ISO 10646), or Unicode™, an > industry subset. > > > > That doesn't say UTF-8. It says UCS or "Unicode". What does that > actually mean? Does it mean UTF-8, or does it mean UTF-16 (closer to > what used to be called "UCS" I think?). Whatever it actually means, do > people violate it in the wild? > First UCS/Unicode basically means the same thing. Second UTF-8, UTF-16, UTF-32 are encoding forms for UCS/Unicode. The MARC documentation does actually say MARC binary records *must* be encoded UTF-8 when LDR/09 content has the value 'a'. You need to refer to the appropriate standards for this information and definitions: <http://www.loc.gov/marc/specifications/speccharucs.html#implementation> Unicode specifies three encoding forms, of which only one, UTF-8 (UCS Transformation Format 8), is authorized for use in MARC 21 records. <http://www.unicode.org/glossary/#UCS> UCS. Acronym for Universal Character Set, which is specified by International Standard ISO/IEC 10646, which is equivalent in repertoire to the Unicode Standard. <http://www.unicode.org/glossary/#unicode_encoding_form> Unicode Encoding Form. A character encoding form that assigns each Unicode scalar value to a unique code unit sequence. The Unicode Standard defines three Unicode encoding forms: UTF-8, UTF-16, and UTF-32. (See definition D79 in Section 3.9, Unicode Encoding Forms.) <http://www.unicode.org/glossary/#UTF_8> UTF-8. A multibyte encoding for text that represents each Unicode character with 1 to 4 bytes, and which is backward-compatible with ASCII. UTF-8 is the predominant form of Unicode in web pages. More technically: (1) The UTF-8 encoding form. (2) The UTF-8 encoding scheme. (3) “UCS Transformation Format 8,” defined in Annex D of ISO/IEC 10646:2003, technically equivalent to the definitions in the Unicode Standard. <http://www.unicode.org/glossary/#UTF_16> UTF-16. A multibyte encoding for text that represents each Unicode character with 2 or 4 bytes; it is not backward-compatible with ASCII. It is the internal form of Unicode in many programming languages, such as Java, C#, and JavaScript, and in many operating systems. More technically: (1) The UTF-16 encoding form. (2) The UTF-16 encoding scheme. (3) “Transformation format for 16 planes of Group 00,” defined in Annex C of ISO/IEC 10646:2003; technically equivalent to the definitions in the Unicode Standard. Andy