> Jonathan Rochkind
> Sent: Tuesday, April 17, 2012 19:55
> Subject: [CODE4LIB] more on MARC char encoding: Now we're about
> ISO_2709 and MARC21
> The LC docs don't actually say quite what I thought about leader byte
> 09, used to advertise encoding:
> a - UCS/Unicode
> Character coding in the record makes use of characters from the
> Universal Coded Character Set (UCS) (ISO 10646), or Unicode™, an
> industry subset.
> That doesn't say UTF-8. It says UCS or "Unicode". What does that
> actually mean? Does it mean UTF-8, or does it mean UTF-16 (closer to
> what used to be called "UCS" I think?). Whatever it actually means, do
> people violate it in the wild?
First UCS/Unicode basically means the same thing. Second UTF-8, UTF-16, UTF-32 are encoding forms for UCS/Unicode. The MARC documentation does actually say MARC binary records *must* be encoded UTF-8 when LDR/09 content has the value 'a'.
You need to refer to the appropriate standards for this information and definitions:
Unicode specifies three encoding forms, of which only one, UTF-8 (UCS Transformation Format 8), is authorized for use in MARC 21 records.
UCS. Acronym for Universal Character Set, which is specified by International Standard ISO/IEC 10646, which is equivalent in repertoire to the Unicode Standard.
Unicode Encoding Form. A character encoding form that assigns each Unicode scalar value to a unique code unit sequence. The Unicode Standard defines three Unicode encoding forms: UTF-8, UTF-16, and UTF-32. (See definition D79 in Section 3.9, Unicode Encoding Forms.)
UTF-8. A multibyte encoding for text that represents each Unicode character with 1 to 4 bytes, and which is backward-compatible with ASCII. UTF-8 is the predominant form of Unicode in web pages. More technically: (1) The UTF-8 encoding form. (2) The UTF-8 encoding scheme. (3) “UCS Transformation Format 8,” defined in Annex D of ISO/IEC 10646:2003, technically equivalent to the definitions in the Unicode Standard.