LISTSERV 16.5 - CODE4LIB Archives

On 4/18/2012 10:33 AM, Karen Coyle wrote:
> UTF-8 was the marc standard from the beginning:
>
> http://www.loc.gov/marc/marbi/1998/98-18.html

Thank you Karen!

Who wants to try to get LC to update the docs at:

http://www.loc.gov/marc/bibliographic/bdleader.html
and
http://www.loc.gov/marc/bibliographic/concise/bdleader.html

accordingly?  They just say "UCS/Unicode", which is vague, and even 
implies the legacy "UCS" encoding (which is a backwards-compatible 
version of what became UTF-16) instead of UTF-8.

Standards documentation, treat them like they matter if you want them to 
matter!

Jonathan

>
> The first proposals were a character mapping between Unicode and MARC-8
> and didn't mention the character encodings, thus the term "UCS" which
> was a common term for Unicode at that time. (see:
> http://www.loc.gov/marc/marbi/1996/96-10.html). But when it got down to
> brass tacks, it was UTF-8, and left open the possibility of UTF-16
> (which was still a viable rival to UTF-8 at the time, as I recall.)
> UTF-16 had the advantage of every character being of uniform length, but
> it also did not cover all of the characters of interest to libraries.
>
> The decision was also made to use byte count rather than character count
> in the directory. This was influenced by the UTF-8 decision.
>
> kc
>
> On 4/18/12 7:04 AM, Jonathan Rochkind wrote:
>> On 4/18/2012 6:04 AM, Tod Olson wrote:
>>> It has to mean UTF-8. ISO 2709 is very byte-oriented, from the
>>> directory structure to the byte-offsets in the fixed fields. The
>>> values in these places all assume 8-bit character data, it's
>>> completely baked in to the file format.
>>
>> I'm not sure that follows. One could certainly have UTF-16 in a Marc
>> record, and still count bytes to get a directory structure and byte
>> offsets. (In some ways it'd be easier since every char would be two
>> bytes).
>>
>> In fact, I worry that the standard may pre-date UTF-8, with it's
>> reference to "UCS" --- if I understand things right, at one point there
>> was only one unicode encoding, called "UCS", which is basically a
>> backwards-compatible subset of what became UTF-16.
>>
>> So I worry the standard really "means" UCS/UTF-16.
>>
>> But if in fact records in the wild with the 'u' value are far more
>> likely to be UTF-8... well it's certainly not the first time the MARC21
>> standard was useless/ignored as a standard in answering such questions.
>