Ken-- You may find a reason to create a normalized "stealth" field, but I have a couple of suggestions that will probably help you avoid that scenario. 1) Read up a little on the Unicode Normalization Forms (http://unicode.org/reports/tr15/) and convert all your UTF-8 characters to the composed form (NFC). The standard for MARC data is the decomposed form (NFD), but this is a real pain to work with if you like things to sort nicely (at least in MySQL). One way to do this is in perl with Unicode::Normalize. 2) Use a collation other than utf8-bin (here's where you lost your case insensitivity, I think). Try utf8_unicode_ci (ci as in case insensitive). I wish I had written down everything I learned about this stuff, but I didn't--and I keep having to go back and refresh my memory. Mike -- Michael Kreyche Systems Librarian / Associate Professor Libraries and Media Services Kent State University 330-672-1918 > -----Original Message----- > From: Code for Libraries [mailto:[log in to unmask]] On > Behalf Of Ken Irwin > Sent: Wednesday, December 16, 2009 1:26 PM > To: [log in to unmask] > Subject: Re: [CODE4LIB] character-sets for dummies? > > Hi all -- thanks for these fabulous replies. I'm learning a lot. > > Armed with a bit of new knowledge, I've done some tinkering. > I think I've solved my original quandaries, and have opened > new cans of worms. I have a few more specific questions: > > 1) It appears that once I switch my MySQL table over from a > latin character set to UTF-8, it is not longer > case-insensitive (this makes sense based on what I learned > from the Joel on Software post). All of the scripting I've > done until now takes advantage of the case insensitivity; is > there an easy way to keep this case insensitive while in UTF-8? > > 2) Is there a good/easy way to make the database agnostic > about diacritics, so that a search for "cafe" will also find "café" > > The answers to both of these may be "convert data to some > normalized A-Z field that never displays, but I can only > imagine that normalizing even > most-Roman-characters-with-diacritics to plain ASCII-style > characters can be daunting task. > > Any advice on these particulars? > > Thanks, > Ken >