yaz-marcdump does a really good job of charset and format conversion for MARC records, and is blindingly fast.
But yaz-marcdump seems to think there are a lot of separators in the wrong place and bad indicator data, whether treating the records as UTF-8 or MARC-8. The leaders in the records say they are UTF-8, but looking at the data, the byte sequences that Jon G. noticed reminds me of UTF-8 data that was UTF-8-encoded a second time. I wonder if they go re-encoded in transmission somewhere along the way. Maybe just in the download from zoila.
-Tod
On Apr 6, 2011, at 4:11 PM, Jonathan Rochkind wrote:
> That's hilarious, that Terry has had to do enough ugliness with Marc
> encodings that he indeed can recognize 0xC2 off the bat as the Marc8
> encoding it represents! I am in awe, as well as sympathy.
>
> If the record is in Marc8, then you need to know if Perl Batch::Marc can
> handle Marc8. If it's supposed to be able to handle it, you need to
> figure out why it's not. (leader byte says UTF-8 even though it's really
> Marc8?).
>
> If Batch::Marc can't handle Marc8, you need to convert to UTF-8 first.
> The only software package I know of that can convert from and to Marc8
> encoding is Java Marc4J, but I wouldn't be shocked if there was
> something in Perl to do it. (But yes, as you can tell by the name,
> "Marc8" is a character encoding ONLY used in Marc, nobody but library
> people write software for dealing with it).
>
> On 4/6/2011 5:01 PM, Reese, Terry wrote:
>> I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker in MARC-8. I'd guess the file isn't in UTF8.
>>
>> --TR
>>
>>> -----Original Message-----
>>> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
>>> Jonathan Rochkind
>>> Sent: Wednesday, April 06, 2011 1:28 PM
>>> To: [log in to unmask]
>>> Subject: Re: [CODE4LIB] utf8 "\xC2" does not map to Unicode
>>>
>>> I am not familar with that Perl module. But I'm more familiar then I'd want
>>> with char encoding in Marc.
>>>
>>> I don't recognize the bytes 0xC2 (there are some bytes I became pathetically
>>> familiar with in past debugging, but I've forgotten em), but the first things to
>>> look at:
>>>
>>> 1. Is your Marc file encoded in Marc8 or UTF-8? I'm betting Marc8.
>>> Theoretically there is a Marc leader byte that tells you whether it's
>>> Marc8 or UTF-8, but the leader byte is often wrong in real world records. Is it
>>> wrong?
>>>
>>> 2. Does Perl MARC::Batch have a function to convert from Marc8 to
>>> UTF-8? If so, how does it decide whether to convert? Is it trying to
>>> do that? Is it assuming that the leader byte the record accurately
>>> identifies the encoding, and if so, is the leader byte wrong? Is it
>>> trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the
>>> first place? Or is it assuming the source was UTF-8 in the first place, when in
>>> fact it was Marc8?
>>>
>>> Not the answer you wanted, maybe someone else will have that. Debugging
>>> char encoding is hands down the most annoying kind of debugging I ever do.
>>>
>>> On 4/6/2011 4:13 PM, Eric Lease Morgan wrote:
>>>> Ack! While using the venerable Perl MARC::Batch module I get the
>>> following error while trying to read a MARC record:
>>>> utf8 "\xC2" does not map to Unicode
>>>>
>>>> This is a real pain, and I'm hoping someone here can help me either: 1) trap
>>> this error allowing me to move on, or 2) figure out how to open the file
>>> "correctly".
Tod Olson <[log in to unmask]>
Systems Librarian
University of Chicago Library
|