Ken--
You may find a reason to create a normalized "stealth" field, but I have a couple of suggestions that will probably help you avoid that scenario.
1) Read up a little on the Unicode Normalization Forms (http://unicode.org/reports/tr15/) and convert all your UTF-8 characters to the composed form (NFC). The standard for MARC data is the decomposed form (NFD), but this is a real pain to work with if you like things to sort nicely (at least in MySQL). One way to do this is in perl with Unicode::Normalize.
2) Use a collation other than utf8-bin (here's where you lost your case insensitivity, I think). Try utf8_unicode_ci (ci as in case insensitive).
I wish I had written down everything I learned about this stuff, but I didn't--and I keep having to go back and refresh my memory.
Mike
--
Michael Kreyche
Systems Librarian / Associate Professor
Libraries and Media Services
Kent State University
330-672-1918
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On
> Behalf Of Ken Irwin
> Sent: Wednesday, December 16, 2009 1:26 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] character-sets for dummies?
>
> Hi all -- thanks for these fabulous replies. I'm learning a lot.
>
> Armed with a bit of new knowledge, I've done some tinkering.
> I think I've solved my original quandaries, and have opened
> new cans of worms. I have a few more specific questions:
>
> 1) It appears that once I switch my MySQL table over from a
> latin character set to UTF-8, it is not longer
> case-insensitive (this makes sense based on what I learned
> from the Joel on Software post). All of the scripting I've
> done until now takes advantage of the case insensitivity; is
> there an easy way to keep this case insensitive while in UTF-8?
>
> 2) Is there a good/easy way to make the database agnostic
> about diacritics, so that a search for "cafe" will also find "café"
>
> The answers to both of these may be "convert data to some
> normalized A-Z field that never displays, but I can only
> imagine that normalizing even
> most-Roman-characters-with-diacritics to plain ASCII-style
> characters can be daunting task.
>
> Any advice on these particulars?
>
> Thanks,
> Ken
>
|