On 4/6/2011 2:02 PM, Kyle Banerjee wrote:
> I'd go so far as to question the value of validating redundant data that
> theoretically has meaning but which are never supposed to vary. The 4 and
> the 5 simply repeat what is already known about the structure of the MARC
> record. Choking on stuff like this is like having a web browser ask you want
> to do with a page because it lacks a document type declaration.
Well, the problem is when the original Marc4J author took the spec at
it's word, and actually _acted upon_ the '4' and the '5', changing file
semantics if they were different, and throwing an exception if it was a
This actually happened, I'm not making this up! Took me a while to debug.
So do you think he got it wrong? How was he supposed to know he got it
wrong, he wrote to the spec and took it at it's word. Are you SURE there
aren't any Marc formats other than Marc21 out there that actually do use
these bytes with their intended meaning, instead of fixing them? How was
the Marc4J author supposed to be sure of that, or even guess it might be
the case, and know he'd be serving users better by ignoring the spec
here instead of following it? What documents instead of the actual
specifications should he have been looking at to determine that he ought
not to be taking those bytes at their words, but just ignoring them?
To realize that we have so much non-conformant data out there that we're
better off ignoring these bytes, is something you can really only learn
through experience -- and something you can then later realize you're
wrong on too:
Ie: I _thought_ I was writing only for Marc21, but then it turns out
I've got to accept records from Outer Weirdistan that are a kind of
legal Marc that actually uses those bytes for their intended meaning --
better go back and fix my entire software stack, involving various
proprietary and open source products from multiple sources, each of
which has undocumented behavior when it comes to these bytes, maybe they
follow the spec or maybe the follow Kyle's advice, but they don't tell
me. This is a mess.
Maybe this scenario is impossible, maybe there ARE and NEVER HAVE BEEN
any Marc variants that actually use leader bytes 20-22 in this way --
how can I determine that? I've just got to guess and hope for the
best. The point of specifications in the first place is for
inter-operability, so we know that if all software and data conforms to
the spec, then all software and data will interact in expected ways.
Once we start guessing at which parts of the spec we really ought to be
Again, I realize in the actual environment we've got, this is not a
luxury we have. But it's a fault, not a benefit, to have lots of
software everywhere behaving in non-compliant ways and creating invalid
(according to the spec!) data.