LISTSERV 16.5 - CODE4LIB Archives

> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Jonathan Rochkind
> Sent: Tuesday, April 17, 2012 19:55
> To: [log in to unmask]
> Subject: [CODE4LIB] more on MARC char encoding: Now we're about
> ISO_2709 and MARC21
> 
> Okay, forget XML for a moment, let's just look at marc 'binary'.
> 
> First, for Anglophone-centric MARC21.
> 
> The LC docs don't actually say quite what I thought about leader byte
> 09, used to advertise encoding:
> 
> 
> a - UCS/Unicode
> Character coding in the record makes use of characters from the
> Universal Coded Character Set (UCS) (ISO 10646), or Unicode™, an
> industry subset.
> 
> 
> 
> That doesn't say UTF-8. It says UCS or "Unicode". What does that
> actually mean?  Does it mean UTF-8, or does it mean UTF-16 (closer to
> what used to be called "UCS" I think?).  Whatever it actually means, do
> people violate it in the wild?
> 
First UCS/Unicode basically means the same thing. Second UTF-8, UTF-16, UTF-32 are encoding forms for UCS/Unicode. The MARC documentation does actually say MARC binary records *must* be encoded UTF-8 when LDR/09 content has the value 'a'.

You need to refer to the appropriate standards for this information and definitions:

<http://www.loc.gov/marc/specifications/speccharucs.html#implementation>
Unicode specifies three encoding forms, of which only one, UTF-8 (UCS Transformation Format 8), is authorized for use in MARC 21 records.

<http://www.unicode.org/glossary/#UCS>
UCS. Acronym for Universal Character Set, which is specified by International Standard ISO/IEC 10646, which is equivalent in repertoire to the Unicode Standard.

<http://www.unicode.org/glossary/#unicode_encoding_form>
Unicode Encoding Form. A character encoding form that assigns each Unicode scalar value to a unique code unit sequence. The Unicode Standard defines three Unicode encoding forms: UTF-8, UTF-16, and UTF-32. (See definition D79 in Section 3.9, Unicode Encoding Forms.)

<http://www.unicode.org/glossary/#UTF_8>
UTF-8. A multibyte encoding for text that represents each Unicode character with 1 to 4 bytes, and which is backward-compatible with ASCII. UTF-8 is the predominant form of Unicode in web pages. More technically: (1) The UTF-8 encoding form. (2) The UTF-8 encoding scheme. (3) “UCS Transformation Format 8,” defined in Annex D of ISO/IEC 10646:2003, technically equivalent to the definitions in the Unicode Standard.

<http://www.unicode.org/glossary/#UTF_16>
UTF-16. A multibyte encoding for text that represents each Unicode character with 2 or 4 bytes; it is not backward-compatible with ASCII. It is the internal form of Unicode in many programming languages, such as Java, C#, and JavaScript, and in many operating systems. More technically: (1) The UTF-16 encoding form. (2) The UTF-16 encoding scheme. (3) “Transformation format for 16 planes of Group 00,” defined in Annex C of ISO/IEC 10646:2003; technically equivalent to the definitions in the Unicode Standard.

Andy