Print

Print


Eric and Mike wrote:

> > Maybe I should draw search results from MyLibrary and not
> swish-e to
> > display characters correctly? If I draw content from many global
> > sources, then how do I know what character set to use for display?
> >
>
> This is definitely the best thing to do.  Search the
> normallized data and display the original.  Also, if you
> store the documents UTF-8 encoded you won't need to worry
> about the character set, you just need to set the encoding
> for the page to UTF-8 and the browser will take care of the rest.
>

Bear in mind that even in UTF-8 there is more than one way to encode an
accented character. It can be precomposed (using a single character,
e.g. U0089 for lower-case e-acute: this is normalization form C) or
decomposed (using a base character and a non-spacing diacritic, e.g.
U0065 and U0301, lower-case e plus the acute accent: this is
normalization form D). If you're searching at the byte level, you have
to be sure that your index and your search term have been normalized the
same way or they won't match. I've found this FAQ useful for this stuff:
http://pipin.tmd.ns.ac.yu/unicode/unicode-faq.html. In a Java context,
we've used ICU4J (http://icu.sourceforge.net) to normalize stuff
(including stripping accents and normalizing case for different scripts)
for indexing and searching in UTF-8. There's also a C API, which could
presumably be incorporated into a Perl process, but no doubt there are
similar native Perl tools.

In general I think we've got to include i18n from the beginning: pay
attention to character sets of incoming data, normalize as early in the
process as possible (especially if ANSEL is involved!), use
UTF-8-compliant tools, and be consistent. Deliver UTF-8 to the browser
(this site helps with the html:
http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html). This is still
not as easy as it ought to be but at least there are good open-source
tools out there.

Peter

Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
4-30 Cameron Library
University of Alberta Libraries
Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243
e-mail: [log in to unmask]