Hi Sophie,
> To better understand the character encoding issue, can anybody
> point me to some resources or list like UTF8 encoded data but
> not in the MARC8 character set?
That question doesn't lend itself to an easy answer. The full MARC-8 repertoire (when you include all of the alternate character sets) has over 16,000 characters. The latest version of Unicode consists of a repertoire of more than 110,000 characters. So a list of UTF8 encoded data not in the MARC8 character set, would be a pretty long list.
For a more *general* understanding of character encoding issues, I would recommend the following resources:
For a quick library-centric overview, "Coded Character Sets: A Technical Primer for Librarians" web page [1]. Included is a page on "Resources on the Web", which has an emphasis on library automation and the internet environment [2].
For a good explanation about how character sets work in relational databases (as part of the more general topic of globalization/I18n), the Oracle "Globalization Support Guide" [3].
For all the ins and outs of Unicode, the book "Unicode Explained" by Jukka Korpela [4].
-- Michael
[1] http://rocky.uta.edu/doran/charsets/
[2] http://rocky.uta.edu/doran/charsets/resources.html
[3] http://docs.oracle.com/cd/B19306_01/server.102/b14225/toc.htm
[4] http://www.amazon.com/gp/product/059610121X/
# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [log in to unmask]
# http://rocky.uta.edu/doran/
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Deng, Sai
> Sent: Friday, April 20, 2012 8:55 AM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] more on MARC char encoding
>
> If a canned cleaner can be added in MarcEdit to deal with "smart
> quotes/values," that will be great! Besides the smart quotes, please
> consider other special characters including Chemistry and mathematics
> symbols (these are different types of special characters, right?) To
> better understand the character encoding issue, can anybody point me to
> some resources or list like UTF8 encoded data but not in the MARC8
> character set? Thanks a lot.
> Sophie
>
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Jonathan Rochkind
> Sent: Thursday, April 19, 2012 2:14 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] more on MARC char encoding
>
> Ah, thanks Terry.
>
> That canned cleaner in MarcEdit sounds potentially useful -- I'm in a
> continuing battle to keep the character encoding in our local marc corpus
> clean.
>
> (The real blame here is on cataloger interfaces that let catalogers save
> data that are illegal bytes for the character set it's being saved as.
> And/or display the data back to the cataloger using a translation that
> lets them show up as expected even though they are _wrong_ for the
> character set being saved as. Connexion is theoretically the rolls royce
> of cataloger interfaces, does it do this? Gosh I hope not.)
>
> On 4/19/2012 2:20 PM, Reese, Terry wrote:
> > Actually -- the issue isn't one of MARC8 versus UTF8 (since this data
> is being harvested from DSpace and is UTF8 encoded). It's actually an
> issue with user entered data -- specifically, smart quotes and the like.
> These values obviously are not in the MARC8 characterset and cause many
> who transform user entered data (which tend to be used by default on
> Windows) from XML to MARC. If you are sticking with a strickly UTF8
> based system, there generally are not issues because these are valid
> characters. If you move them into a system where the data needs to be
> represented in MARC -- then you have more problems.
> >
> > We do a lot of harvesting, and because of that, we run into these types
> of issues moving data that is in UTF8, but has characters not represented
> in MARC8, from into Connexion and having some of that data flattened.
> Given the wide range of data not in the MARC8 set that can show up in
> UTF8, it's not a surprise that this would happen. My guess is that you
> could add a template to your XSLT translation that attempted to filter
> the most common forms of these "smart quotes/values" and replace them
> with the more standard values. Likewise, if there was a great enough
> need, I could provide a canned cleaner in MarcEdit that could fix many of
> the most common varieties of these "smart quotes/values".
> >
> > --TR
> >
> > -----Original Message-----
> > From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> > Of Jonathan Rochkind
> > Sent: Thursday, April 19, 2012 11:13 AM
> > To: [log in to unmask]
> > Subject: Re: [CODE4LIB] more on MARC char encoding
> >
> > If your records are really in MARC8 not UTF8, your best bet is to use a
> tool to convert them to UTF8 before hitting your XSLT.
> >
> > The open source 'yaz' command line tools can do it for Marc21.
> >
> > The Marc4J package can do it in java, and probably work for any MARC
> variant not just Marc21.
> >
> > Char encoding issues are tricky. You might want to first figure out if
> your records are really in Marc8, thus the problems, or if instead they
> illegally contain bad data or data in some other encoding (Latin1).
> >
> > Char encoding is a tricky topic, you might want to do some reading on
> it in general. The Unicode docs are pretty decent.
> >
> > On 4/19/2012 11:06 AM, Deng, Sai wrote:
> >> Hi list,
> >> I am a Metadata librarian but not a programmer, sorry if my question
> seems naïve. We use XSLT stylesheet to transform some harvested DC
> records from DSpace to MARC in MarcEdit, and then export them to OCLC.
> >> Some characters do not display correctly and need manual editing, for
> example:
> >> In MarcEditor
> Transferred to OCLC Edit in OCLC
> >> Bayes’ theorem
> Bayes⁰́₉ theorem Bayes' theorem
> >> ―it won‘t happen here‖ attitude ⁰́₅it won⁰́₈t happen here⁰́₆
> attitude "it won't happen here" attitude
> >> “Generation Y”
> ⁰́₋Generation Y⁰́₊
> "Generation Y"
> >> listeners‟ evaluations listeners⁰́Ÿ
> evaluations listeners' evaluations
> >> high school – from high school ⁰́₃ from
> high school – from
> >> Co₀․₅Zn₀․₅Fe₂O₄
> Co²́⁰⁰́Þ²́⁵Zn²́⁰⁰́Þ²́⁵Fe²́²O²́⁴
> Co0.5Zn0.5Fe2O4?
> >> μ Îơ
> μ
> >> Nafion® Nafion℗ʼ
> Nafion®
> >> Lévy
> L©♭vy
> Lévy
> >> 43±13.20 years
> 43℗ł13.20 years 43±13.20
> years
> >> 12.6 ± 7.05 ft∙lbs 12.6 ℗ł 7.05 ft⁸́₉lbs
> 12.6 ± 7.05 ft•lbs
> >> ‘Pouring on the Pounds' ⁰́₈Pouring on the Pounds'
> 'Pouring on the Pounds'
> >> k-ε turbulence k-Îæ
> turbulence k-ε turbulence
> >> student—neither parents student⁰́₄neither parents
> student-neither parents
> >> Λ = M – {p1, p2,…,pκ} Î₎ = M ⁰́₃ {p1,
> p2,⁰́Œ,pÎð} ? (won’t save)
> >> M = (0, δ)x × Y M = (0,
> Îþ)x ©₇ Y ?
> >> 100° 100℗ð
> 100⁰
> >> (α ≥16º) (Îł ⁹́Æ16℗ð)
> (α>=16⁰)
> >> naïve na©¯ve
>
> naïve
> >>
> >> To deal with this, we normally replace limited numbers of characters
> in MarcEditor first and then do the compiling and transfer. For example:
> replace ’ to ', “ to ", ” to " and ‟ to '. I am not sure about the right
> and efficient way to solve this problem. I see that the XSLT stylesheet
> specifies encoding="UTF-8". Is there a systematic way to make the
> character transform and display right? Thank you for your suggestion and
> feedback!
> >>
> >> Sophie
> >>
> >> -----Original Message-----
> >> From: Code for Libraries [mailto:[log in to unmask]] On Behalf
> >> Of Tod Olson
> >> Sent: Tuesday, April 17, 2012 10:13 PM
> >> To: [log in to unmask]
> >> Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about
> >> ISO_2709 and MARC21
> >>
> >> In practice it seems to mean UTF-8. At least I've only seen UTF-8, and
> I can't imagine the code that processes this stuff being safe for UTF-16
> or UTF-32. All of the offsets are byte-oriented, and there's too much
> legacy code that makes assumption about null-terminated strings.
> >>
> >> -Tod
> >>
> >> On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote:
> >>
> >>> Okay, forget XML for a moment, let's just look at marc 'binary'.
> >>>
> >>> First, for Anglophone-centric MARC21.
> >>>
> >>> The LC docs don't actually say quite what I thought about leader byte
> 09, used to advertise encoding:
> >>>
> >>>
> >>> a - UCS/Unicode
> >>> Character coding in the record makes use of characters from the
> Universal Coded Character Set (UCS) (ISO 10646), or Unicode™, an industry
> subset.
> >>>
> >>>
> >>>
> >>> That doesn't say UTF-8. It says UCS or "Unicode". What does that
> actually mean? Does it mean UTF-8, or does it mean UTF-16 (closer to
> what used to be called "UCS" I think?). Whatever it actually means, do
> people violate it in the wild?
> >>>
> >>>
> >>>
> >>> Now we get to non-Anglophone centric marc. I think all of which is
> ISO_2709? A standard which of course is not open access, so I can't get
> it to see what it says.
> >>>
> >>> But leader 09 being used for encoding -- is that Marc21 specific, or
> is it true of any ISO-2709? Marc8 and "unicode" being the only valid
> encodings can't be true of any ISO-2709, right?
> >>>
> >>> Is there a generic ISO-2709 way to deal with this, or not so much?
> >>
> >
|