Print

Print


I'm not quite convinced that it's marc-8 just because there's \xC2 ;).
 If you look at a hex dump I'm seeing a lot of what might be combining
characters.  The leader appears to have 'a' in the field to indicate
unicode.  In the raw hex I'm seeing a lot of  two character sequences
like: 756c 69c3 83c2 a872 (culi....r).  If I knew my utf-8 better, I
could guess what combining diacritics these are.  Doing a look up on
http://www.fileformat.info seems to indicate that this might be utf-8,
a 'DIAERESIS'

When debugging any encoding issue it's always good to know

a) how the records were obtained
b) how have they been manipulated before you touch them (basically,
how many times may they have been converted by some bungling process)?
c) what encoding they claim to be now?
and
d) what encoding they are, if any?


It's been a while since I used Marc::Batch.  Is there any reason
you're using that instead of just using MARC::Record?  I'd try just
creating a MARC::Record object.

I've seen people do really bizarre things to break MARC files such as
editing the raw binary, thus invalidating the leader and the directory
as the byte counts were no longer right)

I hate to say it, but we still come across files that are no longer in
any encoding due to too many bad conversions.  It's possible these are
as well.

The enca tool (haven't used it much) guesses this at utf-8 mixed w/
"non-text data".

Jon