Binkley, Peter wrote:
>Bear in mind that even in UTF-8 there is more than one way to encode an
>accented character. It can be precomposed (using a single character,
>e.g. U0089 for lower-case e-acute: this is normalization form C) or
>decomposed (using a base character and a non-spacing diacritic, e.g.
>U0065 and U0301, lower-case e plus the acute accent: this is
>normalization form D). If you're searching at the byte level, you have
>to be sure that your index and your search term have been normalized the
>same way or they won't match. I've found this FAQ useful for this stuff:
>http://pipin.tmd.ns.ac.yu/unicode/unicode-faq.html. In a Java context,
>we've used ICU4J (http://icu.sourceforge.net) to normalize stuff
>(including stripping accents and normalizing case for different scripts)
>for indexing and searching in UTF-8. There's also a C API, which could
>presumably be incorporated into a Perl process, but no doubt there are
>similar native Perl tools.
>
>In general I think we've got to include i18n from the beginning: pay
>attention to character sets of incoming data, normalize as early in the
>process as possible (especially if ANSEL is involved!), use
>UTF-8-compliant tools, and be consistent. Deliver UTF-8 to the browser
>(this site helps with the html:
>http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html). This is still
>not as easy as it ought to be but at least there are good open-source
>tools out there.
>
>
Wow, it looks like there are some unicode experts at our midst. I am in
the middle of developing an international bibliographic database where
most of the titles are in languages other than EN-US.
Our database will store citations entered in via a web form since the
bibliography is in card format. I am using MySQL 4 because of the
unicode support and collations. I normally use postgres, but I figured
for a database that will mainly be used for searching only (very little
writes after the data has been populated) i'd give MySQL a try.
One feature we would like to offer is searching via the collations. For
example, if I enter the phrase francais, i would hope that any items
with the term français would result. Is it correct to use MySQL's
collations for this? Does anyone have experience with this?
I am still learning the uses of UTF-8 characters, so I am glad there are
so many of you who know so much about this on this list!
Andrew
|