>>> In Perl, how do I specify MARC-8 when reading (decoding) and writing
>>> (encoding) data?
>>
>> You can't. MARC-8 is a character set that is unknown to the operating system. Your best bet is to convert MARC-8-encoded records into UTF-8.
>
> /me throws his hands up in the air and screams!
>
> Okay. How do I go about converting MARC-8 encoded records into UTF-8? I know yaz-marcdump changes the encoding bit in MARC leaders. Does it also convert MARC-8 characters to UTF-8? (I guess I could simply try it and see what happens.)
>
I seem to remember there was an older version of yaz-marcdump that
seemed a bit buggy (would just change the header but not change
encoding despite command-line options, if there was a certain
combination chosen). It's also possible I was just working with a
script that specified the encoding change but not the leader.
I'd say get the most recent version of yaz (don't use anything in an
OS repository) and then follow the docs:
http://www.indexdata.com/yaz/doc/yaz-marcdump.html. The first example
is what you want:
yaz-marcdump -f MARC-8 -t UTF-8 -o marc -l 9=97 marc21.raw >marc21.utf8.raw
The -f is the source encoding, the -t is the target encoding, and the
-l 9=97 sets leader to a (decimal of character to change the 9th
character to a).
I've typically found this is one of the easier ways to do the
character set encoding, although the various Perl modules (if they're
recent enough) should be able to handle the conversion as well through
the MARC::Charset library. Check the cpan pages.
Jon Gorman
ps. For the love of all that is good, don't try to do anything in
Perl with the raw MARC record to do the encoding change yourself.
I've seen someone really screw records up because they altered
individual characters, which in turn lead to different byte lengths.
This caused all sorts of insanity which meant really weird things
happened with MARC parsers that tried to follow the MARC directory
(which uses byte addresses to deal with variable fields).
|