LISTSERV 16.5 - CODE4LIB Archives

That's hilarious, that Terry has had to do enough ugliness with Marc 
encodings that he indeed can recognize 0xC2 off the bat as the Marc8 
encoding it represents!  I am in awe, as well as sympathy.

If the record is in Marc8, then you need to know if Perl Batch::Marc can 
handle Marc8.  If it's supposed to be able to handle it, you need to 
figure out why it's not. (leader byte says UTF-8 even though it's really 
Marc8?).

If Batch::Marc can't handle Marc8, you need to convert to UTF-8 first. 
The only software package I know of that can convert from and to Marc8 
encoding is Java Marc4J, but I wouldn't be shocked if there was 
something in Perl to do it. (But yes, as you can tell by the name, 
"Marc8" is a character encoding ONLY used in Marc, nobody but library 
people write software for dealing with it).

On 4/6/2011 5:01 PM, Reese, Terry wrote:
> I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker in MARC-8.  I'd guess the file isn't in UTF8.
>
> --TR
>
>> -----Original Message-----
>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
>> Jonathan Rochkind
>> Sent: Wednesday, April 06, 2011 1:28 PM
>> To: [log in to unmask]
>> Subject: Re: [CODE4LIB] utf8 "\xC2" does not map to Unicode
>>
>> I am not familar with that Perl module. But I'm more familiar then I'd want
>> with char encoding in Marc.
>>
>> I don't recognize the bytes 0xC2 (there are some bytes I became pathetically
>> familiar with in past debugging, but I've forgotten em), but the first things to
>> look at:
>>
>> 1. Is your Marc file encoded in Marc8 or UTF-8?  I'm betting Marc8.
>> Theoretically there is a Marc leader byte that tells you whether it's
>> Marc8 or UTF-8, but the leader byte is often wrong in real world records.  Is it
>> wrong?
>>
>> 2. Does Perl MARC::Batch  have a function to convert from Marc8 to
>> UTF-8?   If so, how does it decide whether to convert? Is it trying to
>> do that?  Is it assuming that the leader byte the record accurately
>> identifies the encoding, and if so, is the leader byte wrong?   Is it
>> trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the
>> first place?  Or is it assuming the source was UTF-8 in the first place, when in
>> fact it was Marc8?
>>
>> Not the answer you wanted, maybe someone else will have that. Debugging
>> char encoding is hands down the most annoying kind of debugging I ever do.
>>
>> On 4/6/2011 4:13 PM, Eric Lease Morgan wrote:
>>> Ack! While using the venerable Perl MARC::Batch module I get the
>> following error while trying to read a MARC record:
>>>     utf8 "\xC2" does not map to Unicode
>>>
>>> This is a real pain, and I'm hoping someone here can help me either: 1) trap
>> this error allowing me to move on, or 2) figure out how to open the file
>> "correctly".