On 4/18/2012 10:33 AM, Karen Coyle wrote: > UTF-8 was the marc standard from the beginning: > > http://www.loc.gov/marc/marbi/1998/98-18.html Thank you Karen! Who wants to try to get LC to update the docs at: http://www.loc.gov/marc/bibliographic/bdleader.html and http://www.loc.gov/marc/bibliographic/concise/bdleader.html accordingly? They just say "UCS/Unicode", which is vague, and even implies the legacy "UCS" encoding (which is a backwards-compatible version of what became UTF-16) instead of UTF-8. Standards documentation, treat them like they matter if you want them to matter! Jonathan > > The first proposals were a character mapping between Unicode and MARC-8 > and didn't mention the character encodings, thus the term "UCS" which > was a common term for Unicode at that time. (see: > http://www.loc.gov/marc/marbi/1996/96-10.html). But when it got down to > brass tacks, it was UTF-8, and left open the possibility of UTF-16 > (which was still a viable rival to UTF-8 at the time, as I recall.) > UTF-16 had the advantage of every character being of uniform length, but > it also did not cover all of the characters of interest to libraries. > > The decision was also made to use byte count rather than character count > in the directory. This was influenced by the UTF-8 decision. > > kc > > On 4/18/12 7:04 AM, Jonathan Rochkind wrote: >> On 4/18/2012 6:04 AM, Tod Olson wrote: >>> It has to mean UTF-8. ISO 2709 is very byte-oriented, from the >>> directory structure to the byte-offsets in the fixed fields. The >>> values in these places all assume 8-bit character data, it's >>> completely baked in to the file format. >> >> I'm not sure that follows. One could certainly have UTF-16 in a Marc >> record, and still count bytes to get a directory structure and byte >> offsets. (In some ways it'd be easier since every char would be two >> bytes). >> >> In fact, I worry that the standard may pre-date UTF-8, with it's >> reference to "UCS" --- if I understand things right, at one point there >> was only one unicode encoding, called "UCS", which is basically a >> backwards-compatible subset of what became UTF-16. >> >> So I worry the standard really "means" UCS/UTF-16. >> >> But if in fact records in the wild with the 'u' value are far more >> likely to be UTF-8... well it's certainly not the first time the MARC21 >> standard was useless/ignored as a standard in answering such questions. >