On 11 April 2011 16:40, Jonathan Rochkind <[log in to unmask]> wrote:
> XML well-formedness and validity checks can't find badly encoded characters
> either -- char data that claims to be one encoding but is really another, or
> that has been double-encoded and now means something different than
> There's really no way to catch that but heuristics. All of the
> marc-validating and well-formedness-checking in the world wouldn't prevent
> you from this problem, if people/software don't properly keep track of their
> encodings and not put mis-encoded chars in the data.
Right. Double-encoding, or encoding one way while telling the record
you did it another way, is a data-level pilot error -- on a par with
the kind of error when someone means to type "you're" but types
"your". The error is not wit hthe MARC record, but with the data
that's been put INTO the MARC records.
> On 4/11/2011 11:31 AM, Eric Lease Morgan wrote:
>> On Apr 6, 2011, at 5:39 PM, Jon Gorman wrote:
>>> When debugging any encoding issue it's always good to know:
>>> a) how the records were obtained
>>> b) how have they been manipulated before you
>>> touch them (basically, how many times may
>>> they have been converted by some bungling
>>> c) what encoding they claim to be now? and
>>> d) what encoding they are, if any?
>> I'm making headway on my MARC records, but only through the use of brute
>> I used wget to retrieve the MARC records (as well as associated PDF and
>> text files) from the Internet Archive. The process resulted in 538 records.
>> I then used marcdump to look at the records individually. When it choked on
>> some weird character I renamed the offending file and re-examined the lot
>> again. Through this process my pile of records dwindled to 523. I then
>> concatenated the non-offending records into a single file, and I made them
>> available, again, at the URL above. Now, when I use marcdump it does not
>> crash and burn on tor.marc, but it does say there are 121 errors.
>> I did play a bit with yaz-marcdump to seemingly convert things from marc-8
>> to utf-8, but I'm not so sure it does what is expected. Does it actually
>> convert characters, or does it simply change a value in the leader of each
>> record? If the former, then how do I know it is not double-encoding things?
>> If the later, then my resulting data set is still broken.
>> Upon reflection, I think the validation of MARC records ought to be
>> exactly the same as the validation of XML. First they should be well-formed.
>> Leader. Directory. Bibliographic section. Complete with ASCII characters 29,
>> 30, and 31 in the proper locations. Second, they should validate. This means
>> fields where integers are expected should include integers. It means there
>> are characters in 245. Etc. Third, the data should be meaningful. The
>> characters in 245 should be titles. The characters in 020 should be ISN
>> numbers (not ISBN number and then "(pbk)"). Etc. Finally, the data should be
>> accurate. The titles placed in 245 are the real titles. The author names are
>> the real author names. Etc. Validations #1-#3 can be done by computers.
>> Validation #4 is the work of humans.
>> If MARC records are not well-formed and do not validate according to the
>> standard, then just like XML processors, they should be used. Garbage in.
>> Garbage out.