Hi Ralph,
> But, ignoring the encoding, the original MarcXML rules were the same as
> the MARC-21 rules for character repertoire and you were suppose to
> restrict yourself to characters that could be mapped back into MARC-8.
> I don't know if that rule is still in force, but everyone ignores it.
That rule no longer applies per the December 2007 revision of the MARC 21 Specifications:
"To facilitate the movement of records between MARC-8
and Unicode environments, it was recommended for an
initial period that the use of Unicode be restricted
to a repertoire identical in extent to the MARC-8
repertoire. [...] however, such a restriction is no
longer appropriate. The full UCS repertoire, as currently
defined at the Unicode web site, is valid for encoding
MARC 21 records subject only to the constraints described
[in the current MARC 21 Specifications]."
-- from MARC 21 Specifications (revised December 2007) [1]
-- Michael
[1] http://www.loc.gov/marc/specifications/speccharucs.html
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> LeVan,Ralph
> Sent: Tuesday, April 17, 2012 12:51 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] MarcXML and char encodings
>
> There are probably a couple of answers to that.
>
> XML rules define what characterset is used. The "encoding" attribute on
> the <?xml?> header is where you find out what characterset is being
> used.
>
> I've always gone under the assumption that if an encoding wasn't
> specified, then UTF-8 is in effect and that has always worked for me.
> It turns out the standard says US-ASCII is the default encoding.
>
> But, ignoring the encoding, the original MarcXML rules were the same as
> the MARC-21 rules for character repertoire and you were suppose to
> restrict yourself to characters that could be mapped back into MARC-8.
> I don't know if that rule is still in force, but everyone ignores it.
>
> I hope that helps!
>
> Ralph
>
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Jonathan Rochkind
> Sent: Tuesday, April 17, 2012 12:35 PM
> To: [log in to unmask]
> Subject: MarcXML and char encodings
>
> I know how char encodings work in MARC ISO binary -- the encoding can
> legally be either Marc8 or UTF8 (nothing else). The encoding of a
> record is specified in it's header. In the wild, specified encodings are
>
> frequently wrong, or data includes weird mixed encodings. Okay!
>
> But what's going on with MarcXML? What are the legal encodings for
> MarcXML? Only Marc8 and UTF8, or anything that can be expressed in
> XML? The MARC header is (or can) be present in MarcXML -- trust the
> MARC header, or trust the XML doctype char encoding?
>
> What's the legal thing to do? What's actually found 'in the wild' with
> MarcXML?
>
> Can anyone advise?
>
> Jonathan
|