Print

Print


On 4/6/2011 2:02 PM, Kyle Banerjee wrote:
> I'd go so far as to question the value of validating redundant data that
> theoretically has meaning but which are never supposed to vary. The 4 and
> the 5 simply repeat what is already known about the structure of the MARC
> record. Choking on stuff like this is like having a web browser ask you want
> to do with a page because it lacks a document type declaration.

Well, the problem is when the original Marc4J author took the spec at 
it's word, and actually _acted upon_ the '4' and the '5', changing file 
semantics if they were different, and throwing an exception if it was a 
non-digit.

This actually happened, I'm not making this up!  Took me a while to debug.

So do you think he got it wrong?  How was he supposed to know he got it 
wrong, he wrote to the spec and took it at it's word. Are you SURE there 
aren't any Marc formats other than Marc21 out there that actually do use 
these bytes with their intended meaning, instead of fixing them? How was 
the Marc4J author supposed to be sure of that, or even guess it might be 
the case, and know he'd be serving users better by ignoring the spec 
here instead of following it?  What documents instead of the actual 
specifications should he have been looking at to determine that he ought 
not to be taking those bytes at their words, but just ignoring them?

To realize that we have so much non-conformant data out there that we're 
better off ignoring these bytes, is something you can really only learn 
through experience -- and something you can then later realize you're 
wrong on too:

Ie: I _thought_ I was writing only for Marc21, but then it turns out 
I've got to accept records from Outer Weirdistan that are a kind of 
legal Marc that actually uses those bytes for their intended meaning -- 
better go back and fix my entire software stack, involving various 
proprietary and open source products from multiple sources, each of 
which has undocumented behavior when it comes to these bytes, maybe they 
follow the spec or maybe the follow Kyle's advice, but they don't tell 
me.  This is a mess.

Maybe this scenario is impossible, maybe there ARE and NEVER HAVE BEEN 
any Marc variants that actually use leader bytes 20-22 in this way -- 
how can I determine that?  I've just got to guess and hope for the 
best.  The point of specifications in the first place is for 
inter-operability, so we know that if all software and data conforms to 
the spec, then all software and data will interact in expected ways.  
Once we start guessing at which parts of the spec we really ought to be 
ignoring....

Again, I realize in the actual environment we've got, this is not a 
luxury we have. But it's a fault, not a benefit, to have lots of 
software everywhere behaving in non-compliant ways and creating invalid 
(according to the spec!) data.