Print

Print


> Jonathan Rochkind
> Sent: Tuesday, April 17, 2012 14:18
> Subject: Re: [CODE4LIB] MarcXML and char encodings
> 
> Okay, maybe here's another way to approach the question.
> 
> If I want to have a MarcXML document encoded in Marc8 -- what should it
> look like?  What should be in the XML decleration? What should be in
> the
> MARC header embedded in the XML?  Or is it not in fact legal at all?
> 
> If I want to have a MarcXML document encoded in UTF8, what should it
> look like? What should be in the XML decleration? What should be in the
> MARC header embedded in the XML?
> 
> If I want to have a MarcXML document with a char encoding that is
> _neither_ Marc8 nor UTF8, but something else generally legal for XML --
> is this legal at all? And if so, what should it look like? What should
> be in the XML decleration? What should be in the MARC header embedded
> in
> the XML?

You cannot have a MARC-XML document encoded in MARC-8, well sort of, but it's not standard. To answer your questions you have to refer to a variety of standards:

<http://www.w3.org/TR/2008/REC-xml-20081126/#NT-EncodingDecl>
In an encoding declaration, the values " UTF-8 ", " UTF-16 ", " ISO-10646-UCS-2 ", and " ISO-10646-UCS-4 " should be used for the various encodings and transformations of Unicode / ISO/IEC 10646, the values " ISO-8859-1 ", " ISO-8859-2 ", ... " ISO-8859- n " (where n is the part number) should be used for the parts of ISO 8859, and the values " ISO-2022-JP ", " Shift_JIS ", and " EUC-JP " should be used for the various encoded forms of JIS X-0208-1997. It is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to using their registered names; other encodings should use names starting with an "x-" prefix. XML processors should match character encoding names in a case-insensitive way and should either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-registered encodings).

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.
----

1) The above says that <?xml version="1.0" ?> means the same as <?xml version="1.0" encoding="utf-8" ?> and if you prefer you can omit the XML declaration and that is assumed to be UTF-8 unless there is a BOM (Byte Order Mark) which determines UTF-8 vs UTF-16BE vs UTF-16LE.

2) If you really wanted to encode the XML in MARC-8 you need to specify "x-" since if you refer to: <http://www.iana.org/assignments/character-sets> MARC-8 isn't a registered character set, hence cannot be specified in the encoding attribute unless the name was prefixed with "x-". Which implies that no standard XML library will know how to convert the MARC-8 characters into Unicode so the XML DOM can be used. So unless you want to write your own MARC-8 <=> Unicode conversion routines and integrate them your preferred XML library it isn't going to work out of the box for anyone else but yourself.

When dealing with MARC-XML you should ignore the values in LDR/00-04, LDR/10, LDR/11, LDR/12-16, LDR/20-23. If you look at the MARC-XML schema you will note that the definition for leaderDataType specifies LDR/00-04 "[\d ]{5}", LDR/10 and LDR/11 "(2| )", LDR/12-16 "[\d ]{5}", LDR/20-23 "(4500| )". Note the MARC-XML schema allows spaces in those positions because they are not relevant in the XML format, though very relevant in the binary format.

You probably should ignore LDR/09 since most MARC to MARC-XML converters do not change this value to 'a' although many converters do change the value when converting MARC binary between MARC-8 and UTF-8. The only valid character set for MARC-XML is Unicode and it *should* be encoded in UTF-8 in Unicode normalization form D (NFD) although most XML libraries will not know the difference if it was encoded as UTF-16BE or UTF-16LE in Unicode normalization form D since the XML libraries internally work with Unicode.

I could have sworn that this information was specified on LC's site at one point in time, but I'm having trouble finding the documentation.


Hope this helps, Andy.