LISTSERV 16.5 - CODE4LIB Archives

yaz-marcdump does a really good job of charset and format conversion for MARC records, and is blindingly fast.

But yaz-marcdump seems to think there are a lot of separators in the wrong place and bad indicator data, whether treating the records as UTF-8 or MARC-8.  The leaders in the records say they are UTF-8, but looking at the data, the byte sequences that Jon G. noticed reminds me of UTF-8 data that was UTF-8-encoded a second time.  I wonder if they go re-encoded in transmission somewhere along the way.  Maybe just in the download from zoila.

-Tod

On Apr 6, 2011, at 4:11 PM, Jonathan Rochkind wrote:

> That's hilarious, that Terry has had to do enough ugliness with Marc 
> encodings that he indeed can recognize 0xC2 off the bat as the Marc8 
> encoding it represents!  I am in awe, as well as sympathy.
> 
> If the record is in Marc8, then you need to know if Perl Batch::Marc can 
> handle Marc8.  If it's supposed to be able to handle it, you need to 
> figure out why it's not. (leader byte says UTF-8 even though it's really 
> Marc8?).
> 
> If Batch::Marc can't handle Marc8, you need to convert to UTF-8 first. 
> The only software package I know of that can convert from and to Marc8 
> encoding is Java Marc4J, but I wouldn't be shocked if there was 
> something in Perl to do it. (But yes, as you can tell by the name, 
> "Marc8" is a character encoding ONLY used in Marc, nobody but library 
> people write software for dealing with it).
> 
> On 4/6/2011 5:01 PM, Reese, Terry wrote:
>> I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker in MARC-8.  I'd guess the file isn't in UTF8.
>> 
>> --TR
>> 
>>> -----Original Message-----
>>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
>>> Jonathan Rochkind
>>> Sent: Wednesday, April 06, 2011 1:28 PM
>>> To: [log in to unmask]
>>> Subject: Re: [CODE4LIB] utf8 "\xC2" does not map to Unicode
>>> 
>>> I am not familar with that Perl module. But I'm more familiar then I'd want
>>> with char encoding in Marc.
>>> 
>>> I don't recognize the bytes 0xC2 (there are some bytes I became pathetically
>>> familiar with in past debugging, but I've forgotten em), but the first things to
>>> look at:
>>> 
>>> 1. Is your Marc file encoded in Marc8 or UTF-8?  I'm betting Marc8.
>>> Theoretically there is a Marc leader byte that tells you whether it's
>>> Marc8 or UTF-8, but the leader byte is often wrong in real world records.  Is it
>>> wrong?
>>> 
>>> 2. Does Perl MARC::Batch  have a function to convert from Marc8 to
>>> UTF-8?   If so, how does it decide whether to convert? Is it trying to
>>> do that?  Is it assuming that the leader byte the record accurately
>>> identifies the encoding, and if so, is the leader byte wrong?   Is it
>>> trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the
>>> first place?  Or is it assuming the source was UTF-8 in the first place, when in
>>> fact it was Marc8?
>>> 
>>> Not the answer you wanted, maybe someone else will have that. Debugging
>>> char encoding is hands down the most annoying kind of debugging I ever do.
>>> 
>>> On 4/6/2011 4:13 PM, Eric Lease Morgan wrote:
>>>> Ack! While using the venerable Perl MARC::Batch module I get the
>>> following error while trying to read a MARC record:
>>>>    utf8 "\xC2" does not map to Unicode
>>>> 
>>>> This is a real pain, and I'm hoping someone here can help me either: 1) trap
>>> this error allowing me to move on, or 2) figure out how to open the file
>>> "correctly".

Tod Olson <[log in to unmask]>
Systems Librarian
University of Chicago Library