LISTSERV 16.5 - CODE4LIB Archives

UTF-8 was the marc standard from the beginning:

http://www.loc.gov/marc/marbi/1998/98-18.html

The first proposals were a character mapping between Unicode and MARC-8 
and didn't mention the character encodings, thus the term "UCS" which 
was a common term for Unicode at that time. (see: 
http://www.loc.gov/marc/marbi/1996/96-10.html). But when it got down to 
brass tacks, it was UTF-8, and left open the possibility of UTF-16 
(which was still a viable rival to UTF-8 at the time, as I recall.) 
UTF-16 had the advantage of every character being of uniform length, but 
it also did not cover all of the characters of interest to libraries.

The decision was also made to use byte count rather than character count 
in the directory. This was influenced by the UTF-8 decision.

kc

On 4/18/12 7:04 AM, Jonathan Rochkind wrote:
> On 4/18/2012 6:04 AM, Tod Olson wrote:
>> It has to mean UTF-8. ISO 2709 is very byte-oriented, from the
>> directory structure to the byte-offsets in the fixed fields. The
>> values in these places all assume 8-bit character data, it's
>> completely baked in to the file format.
>
> I'm not sure that follows. One could certainly have UTF-16 in a Marc
> record, and still count bytes to get a directory structure and byte
> offsets. (In some ways it'd be easier since every char would be two bytes).
>
> In fact, I worry that the standard may pre-date UTF-8, with it's
> reference to "UCS" --- if I understand things right, at one point there
> was only one unicode encoding, called "UCS", which is basically a
> backwards-compatible subset of what became UTF-16.
>
> So I worry the standard really "means" UCS/UTF-16.
>
> But if in fact records in the wild with the 'u' value are far more
> likely to be UTF-8... well it's certainly not the first time the MARC21
> standard was useless/ignored as a standard in answering such questions.

-- 
Karen Coyle
[log in to unmask] http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet