There are probably a couple of answers to that.
XML rules define what characterset is used. The "encoding" attribute on
the <?xml?> header is where you find out what characterset is being
used.
I've always gone under the assumption that if an encoding wasn't
specified, then UTF-8 is in effect and that has always worked for me.
It turns out the standard says US-ASCII is the default encoding.
But, ignoring the encoding, the original MarcXML rules were the same as
the MARC-21 rules for character repertoire and you were suppose to
restrict yourself to characters that could be mapped back into MARC-8.
I don't know if that rule is still in force, but everyone ignores it.
I hope that helps!
Ralph
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
Jonathan Rochkind
Sent: Tuesday, April 17, 2012 12:35 PM
To: [log in to unmask]
Subject: MarcXML and char encodings
I know how char encodings work in MARC ISO binary -- the encoding can
legally be either Marc8 or UTF8 (nothing else). The encoding of a
record is specified in it's header. In the wild, specified encodings are
frequently wrong, or data includes weird mixed encodings. Okay!
But what's going on with MarcXML? What are the legal encodings for
MarcXML? Only Marc8 and UTF8, or anything that can be expressed in
XML? The MARC header is (or can) be present in MarcXML -- trust the
MARC header, or trust the XML doctype char encoding?
What's the legal thing to do? What's actually found 'in the wild' with
MarcXML?
Can anyone advise?
Jonathan
|