> You could substitute XML with e.g. Base64 encoding if it makes thinking
> about this stuff easier. For instance email clients often send binary files
> in Base64, but it doesn't mean the file is ruined, as the receiving email
> client can decode it back to the original binary.
A bit of an ironic statement, considering a regular, constant
complaint on several library-related mailing lists I'm on is that
emails are coming in "garbled" or need to be sent again in "plain
text". Without fail it's because the person is using a client that
won't or can't deal with Base64. Yes, silly this day and age.
Perhaps I'm just jaded from working into libraries for too long but
your examples assume some logical consistent control through the
process of dealing with MARC data.
Let's think of this scenario instead:
You're using your vendor's ILS system. You stick some html tags into
a record. The vendor's ILS does some different stuff with it like
indexing it, storing the complete record for later retrieval, and
pulling data in the record into a semi-normalized scheme in a
database. Now the librarians that have just enough training to do
some reports for these systems start running them via access and start
shifting the data around in Access, Exce,l and Word. Then a little
while later they start raising alarms because of either: they see the
markup in the record and wonder what's happening and how to remove it
or one of those tools treats that area as text, another as xml
content, and somewhere along the way it gets messed up.
The above is not really all that uncommon of a scenario.
Or how about this scenario:
You add some html internally to a MARC record. You then add it to
your ILS system. A few years later you go to export and decide to do
it in MARCXML. Unknown to you, the ILS doesn't do a sane translation
process, but rather rebuilds the MARCXML from information in the
database that was put there by the original MARC. The code is
horribly setup and hackish and certain fields do not bother to escape
what it's retrieving from the record. You then go to import to your
new ILS, which validates the MARCXML. It of course now croaks because
you have something like
<marc:subfield code="a"><div class="foo">pretty</div>.
Would you count on having someone on the staff who will be able to fix
those MARCXML files? Or did you have someone like that and they
burned out? How long before the support contract on your old ILS
forces you to abandon it?
Plus the fact there's still unresolved questions. Let's take RSS as
an example as a format that has been abused by html in the past. If
you find "html" in RSS, you can't be sure if it valid or well-formed.
Frequently there's no way to know what version of html it is. Yes, in
the end you can just throw it at a html parser or a browser and hope
for the best, but we have to consider the input mechanisms here. Are
folks going to be entering the html by hand? Are there going to be
some sort of macros? Some sort of batch change process? Each have
different level of risks for having bad html. How much extra
processing are you going to want to do for each record each time you
might end up displaying it? What to do with a mistake? Let your
parser determine or their browser?
It's great to say we should simply re-write our tools, but many of us
work with tools supplied by vendors. We may be trying to move to more
open tools and the like but ultimately we're constrained by what our
upper managements dictate.
There's both practical reasons (untrustworthy systems) and more
abstract reasons (how to we communicate which version? namespaces?
etc) issues at play here. Ultimately I do agree that if it could not
be avoided to try putting in the html into the record itself. A gain
of better usability and functionality over a couple of years is
probably worth it as the chance of a large issue later on is quite
small. (Higher chance of small issues though).
I mainly sent out this email though because I don't think the folks
who have been pointing out issues are confused. It's not that we
don't understand that it should be able to "round-trip" or that we
haven't played around with html in other data formats. I think we've
used enough software in the library would to not trust all the layers
will work as they should.
Jon Gorman
|