XML well-formedness and validity checks can't find badly encoded
characters either -- char data that claims to be one encoding but is
really another, or that has been double-encoded and now means something
different than intended.
There's really no way to catch that but heuristics. All of the
marc-validating and well-formedness-checking in the world wouldn't
prevent you from this problem, if people/software don't properly keep
track of their encodings and not put mis-encoded chars in the data.
On 4/11/2011 11:31 AM, Eric Lease Morgan wrote:
> On Apr 6, 2011, at 5:39 PM, Jon Gorman wrote:
>> When debugging any encoding issue it's always good to know:
>> a) how the records were obtained
>> b) how have they been manipulated before you
>> touch them (basically, how many times may
>> they have been converted by some bungling
>> c) what encoding they claim to be now? and
>> d) what encoding they are, if any?
> I'm making headway on my MARC records, but only through the use of brute force.
> I used wget to retrieve the MARC records (as well as associated PDF and text files) from the Internet Archive. The process resulted in 538 records. I then used marcdump to look at the records individually. When it choked on some weird character I renamed the offending file and re-examined the lot again. Through this process my pile of records dwindled to 523. I then concatenated the non-offending records into a single file, and I made them available, again, at the URL above. Now, when I use marcdump it does not crash and burn on tor.marc, but it does say there are 121 errors.
> I did play a bit with yaz-marcdump to seemingly convert things from marc-8 to utf-8, but I'm not so sure it does what is expected. Does it actually convert characters, or does it simply change a value in the leader of each record? If the former, then how do I know it is not double-encoding things? If the later, then my resulting data set is still broken.
> Upon reflection, I think the validation of MARC records ought to be exactly the same as the validation of XML. First they should be well-formed. Leader. Directory. Bibliographic section. Complete with ASCII characters 29, 30, and 31 in the proper locations. Second, they should validate. This means fields where integers are expected should include integers. It means there are characters in 245. Etc. Third, the data should be meaningful. The characters in 245 should be titles. The characters in 020 should be ISN numbers (not ISBN number and then "(pbk)"). Etc. Finally, the data should be accurate. The titles placed in 245 are the real titles. The author names are the real author names. Etc. Validations #1-#3 can be done by computers. Validation #4 is the work of humans.
> If MARC records are not well-formed and do not validate according to the standard, then just like XML processors, they should be used. Garbage in. Garbage out.