LISTSERV 16.5 - CODE4LIB Archives

On 4/18/2012 12:08 PM, Jonathan Rochkind wrote:
> On 4/18/2012 11:09 AM, Doran, Michael D wrote:
>> I don't believe that is the case.  Take UTF-8 out of the picture, and 
>> consider the MARC-8 character set with its escape sequences and 
>> combining characters.  A character such as an "n" with a tilde would 
>> consist of two bytes.  The Greek small letter alpha, if invoked in 
>> accordance with ANSI X3.41, would consist of five bytes (two bytes 
>> for the initial escape sequence, a byte for the character, and then 
>> two bytes for the escape sequence returning to the default character 
>> set).
>
> ISO 2709 doesn't care how many bytes your characters are. The 
> directory and offsets and other things count bytes, not characters. 
> (which was, in my opinion, the _right_ decision, for once with marc!)
>
> How bytes translate into characters is not a concern of ISO 2709.
>
> The majority of non-7-bit-ASCII encodings will have chars that are 
> more than one byte, either sometimes or always. This is true of MARC8 
> (some chars), UTF8 (some chars), and UTF16 (all chars), all of them. 
> (It is not true of Latin-1 though, for instance, I don't think).
>
> ISO 2709 doesn't care what char encodings you use, and there's no 
> standard ISO 2709 way to determine what char encodings are used for 
> _data_ in the MARC record. ISO 2709 does say that _structural_ 
> elements like field names, subfield names, the directory itself, 
> seperator chars, etc, all need to be (essentially, over-simplifying) 
> 7-bit-ASCII. The actual data itself is application dependent, 2709 
> doesn't care, and 2709 doesn't give any standard cross-2709 way to 
> determine it.
>
> That is my conclusion at the moment, helped by all of you all in this 
> thread, thanks!

The conclusion that I came to in the work I have done on marc4j (which 
is used heavily by SolrMarc)  is that for any significant processing of 
Marc records the only solution that makes sense is to translate the 
record data into Unicode characters as it is being read in.  Of course 
as you and others have stated, determining what the data actually is, in 
order to correctly translate it to Unicode, is no easy task.  The leader 
byte that merely indicates "is UTF8" or  "is not UTF8" is wrong often 
enough in the real world that it is of little value when it indicates 
"is UTF-8"and is even less value when it indicates "is not UTF-8"

Significant portions of the code I've added to marc4j deal with trying 
to determine what the encoding of that data actually is and trying to 
translate the data correctly into Unicode even when the data is incorrect.

You also argued in another message that cataloger entry tools should 
give feedback to help the cataloger not create errors.   I agree.  I 
think one possible step towards this would be that the editor must work 
in Unicode, irrespective of the data format that the underlying system 
expects the data to be.  If the underlying system expects MARC8 then the 
"save as" process should be able to translate the data into MARC8 on 
output.

-Robert Haschart