Hi Ralph, > But, ignoring the encoding, the original MarcXML rules were the same as > the MARC-21 rules for character repertoire and you were suppose to > restrict yourself to characters that could be mapped back into MARC-8. > I don't know if that rule is still in force, but everyone ignores it. That rule no longer applies per the December 2007 revision of the MARC 21 Specifications: "To facilitate the movement of records between MARC-8 and Unicode environments, it was recommended for an initial period that the use of Unicode be restricted to a repertoire identical in extent to the MARC-8 repertoire. [...] however, such a restriction is no longer appropriate. The full UCS repertoire, as currently defined at the Unicode web site, is valid for encoding MARC 21 records subject only to the constraints described [in the current MARC 21 Specifications]." -- from MARC 21 Specifications (revised December 2007) [1] -- Michael [1] http://www.loc.gov/marc/specifications/speccharucs.html > -----Original Message----- > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of > LeVan,Ralph > Sent: Tuesday, April 17, 2012 12:51 PM > To: [log in to unmask] > Subject: Re: [CODE4LIB] MarcXML and char encodings > > There are probably a couple of answers to that. > > XML rules define what characterset is used. The "encoding" attribute on > the <?xml?> header is where you find out what characterset is being > used. > > I've always gone under the assumption that if an encoding wasn't > specified, then UTF-8 is in effect and that has always worked for me. > It turns out the standard says US-ASCII is the default encoding. > > But, ignoring the encoding, the original MarcXML rules were the same as > the MARC-21 rules for character repertoire and you were suppose to > restrict yourself to characters that could be mapped back into MARC-8. > I don't know if that rule is still in force, but everyone ignores it. > > I hope that helps! > > Ralph > > -----Original Message----- > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of > Jonathan Rochkind > Sent: Tuesday, April 17, 2012 12:35 PM > To: [log in to unmask] > Subject: MarcXML and char encodings > > I know how char encodings work in MARC ISO binary -- the encoding can > legally be either Marc8 or UTF8 (nothing else). The encoding of a > record is specified in it's header. In the wild, specified encodings are > > frequently wrong, or data includes weird mixed encodings. Okay! > > But what's going on with MarcXML? What are the legal encodings for > MarcXML? Only Marc8 and UTF8, or anything that can be expressed in > XML? The MARC header is (or can) be present in MarcXML -- trust the > MARC header, or trust the XML doctype char encoding? > > What's the legal thing to do? What's actually found 'in the wild' with > MarcXML? > > Can anyone advise? > > Jonathan