LISTSERV 16.5 - CODE4LIB Archives

But here's my point. 

There is no way for a consumer of MARC records to know if the MARC 
records contain HTML or not.  If a downstream consumer wants to display 
MARC in an html environment, the consumer can either assume they contain 
html, and then end up displaying MARC _wrong_ if it has has html special 
chars like < or > but does not have html. Or it can assume it does 
_not_have HTML, and end up displaying escaped html tags to the user if 
it really DOES have html.  This really applies no matter what 
presentation format the downstream consumer wants to display in. Plain 
text?  Assume it is html, and strip out html tags, potentially 
accidentally stripping out actual information if it wasn't html but 
contained html special chars. Or assume it's not html and just plain 
text, and just display it, and show the user html tags.

There's no way for a downstream consumer of MARC records to know if data 
is in html or just plain text.  In general, I think this is becuase the 
assumption is it's always just plain text.  If you start putting html in 
there, there's no way for a downstream consumer to predict whether it's 
going to be html or not, because that's not part of the MARC standard to 
advertise that, so there's no way for a downstream consumer to reliably 
display it correctly.  You've put html in counting on your current local 
system being specifically configured to expect html in certain MARC 
fields. Fine. But as soon as you start distributing that MARC to 
downstream consumers, you've made things awfully confusing and 
unpredictable.

Jonathan

Ere Maijala wrote:
> Jonathan Rochkind wrote:
>   
>> Ere Maijala wrote:
>>     
>>> That shouldn't be a problem as any sane OAI-PMH provider, unAPI or ATOM
>>> serializer would escape the contents. Things that resemble HTML tags
>>> could be present in MARC records without any HTML-in-MARC too.
>>>
>>>       
>> Sure, and then, if you have html tags in your marc, that system doing
>> the re-use is going to present content to users with escaped HTML in it,
>> which isn't desirable either!
>>     
>
> How the content is stored in the transport format is separate from how
> it is used. Whatever the re-using system does is not related to how the
> data was transferred to it. If it extracts the stuff from the XML, it
> will of course unescape the content, but what happens after that is up
> to the system and unrelated to the transport mechanism. So here is an
> example of the whole process:
>
> MARC with embedded HTML
> ->
> OAI-PMH provider escapes the MARC in some XML format
> ->
> OAI-PMH harvester (the re-using system) unescapes the data from the XML
> format
> ->
> Something is done with the data
>
> It's the same as if the source system stores the data internally in
> MARCXML. The content must be escaped so that it can be stored in MARCXML
> and doesn't mess up the markup, but when the uses the data e.g. for
> display, it's first retrieved from XML and unescaped, and massaged to
> the desired display format only after that. If you use DOM to do the XML
> manipulation, all this will happen automatically. You just write and
> read strings and DOM manipulation takes care of escaping and unescaping.
>
> You could substitute XML with e.g. Base64 encoding if it makes thinking
> about this stuff easier. For instance email clients often send binary
> files in Base64, but it doesn't mean the file is ruined, as the
> receiving email client can decode it back to the original binary.
>
> --Ere
>