Eric,
Thanks for posting your question, thereby eliciting some interesting comments and suggestions.
I took a quick look at the records and it seems to me at least part of the problem is with the record lengths (and probably fields lengths, too). When calculating record lengths for utf8 encoded records, one should be counting bytes, not characters. It looks like the stated length of at least some of your problem records refers to number of characters.
Mike
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Eric Lease Morgan
Sent: Thursday, October 07, 2010 1:53 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] unwanted (bogus) characters in marc
> How do I trap for unwanted (bogus) characters in MARC records?
Thank you for all of the interest, and a number of you have asked, "Show me a record." In that vein, I'm making the whole of my script available [1]. As long as you have the necessary Perl modules installed you should be able to run it. Be forewarned. In its present state it will create about 500 MARC records in your local directory.
When and if you run the script and you have MARC::Record & friends installed, you should be able to run marcdump against them:
$ marcdump *.mrc
Most records seem just fine, but others make marcdump croak. Thus, the heart of my question lies in the following lines:
# save it
open MARC, " > $identifier.mrc" or die "Can't open $identifier.mrc: $!\n";
binmode MARC, ":utf8";
print MARC $response->content;
close MARC;
Are they sufficient for correctly saving the MARC records locally? Should/can I do some sort of check of $response->content before I call binmode and print? Should I use $response->decode_content instead?
[1] script - http://infomotions.com/tmp/harvest.pl
--
Eric Lease Morgan
Hesburgh Libraries, University of Notre Dame
(574) 631-8604
|