At the time of creation, characters and bytes were 1-to-1 because MARC
used only ASCII. So there was no distinction at the outset. Some
positions are still limited to ascii characters (Leader, fixed fields,
subfield codes, etc.).
kc
On 4/18/12 7:20 AM, Huwig,Steve wrote:
> I could be mistaken (never having had the pleasure of reading it), but
> isn't ISO-2709 specified as a fixed number of characters, and any
> conflation of characters and 8-bit bytes is on the part of users and
> implementations?
>
> I think ISO 2709 might not know from bytes, only characters.
>
>> -----Original Message-----
>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> Of
>> Doran, Michael D
>> Sent: Wednesday, April 18, 2012 10:05 AM
>> To: [log in to unmask]
>> Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about
>> ISO_2709 and MARC21
>>
>> Hi Tod,
>>
>> I'm not understanding how UTF-8 would be considered 8-bit character
>> data (other than the ASCII-range of the Unicode repertoire, natch). I
>> don't think ISO 2709 knows from characters, only bytes.
>>
>> -- Michael
>>
>> # Michael Doran, Systems Librarian
>> # University of Texas at Arlington
>> # 817-272-5326 office
>> # 817-688-1926 mobile
>> # [log in to unmask]
>> # http://rocky.uta.edu/doran/
>>
>>
>>> -----Original Message-----
>>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
>> Of
>>> Tod Olson
>>> Sent: Wednesday, April 18, 2012 5:04 AM
>>> To: [log in to unmask]
>>> Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about
>>> ISO_2709 and MARC21
>>>
>>> It has to mean UTF-8. ISO 2709 is very byte-oriented, from the
>> directory
>>> structure to the byte-offsets in the fixed fields. The values in
>> these
>>> places all assume 8-bit character data, it's completely baked in to
>> the
>>> file format.
>>>
>>> -Tod
>>>
>>> On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote:
>>>
>>>> Okay, forget XML for a moment, let's just look at marc 'binary'.
>>>>
>>>> First, for Anglophone-centric MARC21.
>>>>
>>>> The LC docs don't actually say quite what I thought about leader
>> byte
>>> 09, used to advertise encoding:
>>>>
>>>>
>>>> a - UCS/Unicode
>>>> Character coding in the record makes use of characters from the
>>> Universal Coded Character Set (UCS) (ISO 10646), or Unicode(tm), an
>> industry
>>> subset.
>>>>
>>>>
>>>>
>>>> That doesn't say UTF-8. It says UCS or "Unicode". What does that
>>> actually mean? Does it mean UTF-8, or does it mean UTF-16 (closer
> to
>>> what used to be called "UCS" I think?). Whatever it actually means,
>> do
>>> people violate it in the wild?
>>>>
>>>>
>>>>
>>>> Now we get to non-Anglophone centric marc. I think all of which is
>>> ISO_2709? A standard which of course is not open access, so I can't
>> get
>>> it to see what it says.
>>>>
>>>> But leader 09 being used for encoding -- is that Marc21 specific,
>> or is
>>> it true of any ISO-2709? Marc8 and "unicode" being the only valid
>>> encodings can't be true of any ISO-2709, right?
>>>>
>>>> Is there a generic ISO-2709 way to deal with this, or not so much?
--
Karen Coyle
[log in to unmask] http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet
|