Binkley, Peter wrote: >Bear in mind that even in UTF-8 there is more than one way to encode an >accented character. It can be precomposed (using a single character, >e.g. U0089 for lower-case e-acute: this is normalization form C) or >decomposed (using a base character and a non-spacing diacritic, e.g. >U0065 and U0301, lower-case e plus the acute accent: this is >normalization form D). If you're searching at the byte level, you have >to be sure that your index and your search term have been normalized the >same way or they won't match. I've found this FAQ useful for this stuff: >http://pipin.tmd.ns.ac.yu/unicode/unicode-faq.html. In a Java context, >we've used ICU4J (http://icu.sourceforge.net) to normalize stuff >(including stripping accents and normalizing case for different scripts) >for indexing and searching in UTF-8. There's also a C API, which could >presumably be incorporated into a Perl process, but no doubt there are >similar native Perl tools. > >In general I think we've got to include i18n from the beginning: pay >attention to character sets of incoming data, normalize as early in the >process as possible (especially if ANSEL is involved!), use >UTF-8-compliant tools, and be consistent. Deliver UTF-8 to the browser >(this site helps with the html: >http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html). This is still >not as easy as it ought to be but at least there are good open-source >tools out there. > > Wow, it looks like there are some unicode experts at our midst. I am in the middle of developing an international bibliographic database where most of the titles are in languages other than EN-US. Our database will store citations entered in via a web form since the bibliography is in card format. I am using MySQL 4 because of the unicode support and collations. I normally use postgres, but I figured for a database that will mainly be used for searching only (very little writes after the data has been populated) i'd give MySQL a try. One feature we would like to offer is searching via the collations. For example, if I enter the phrase francais, i would hope that any items with the term français would result. Is it correct to use MySQL's collations for this? Does anyone have experience with this? I am still learning the uses of UTF-8 characters, so I am glad there are so many of you who know so much about this on this list! Andrew