LISTSERV 16.5 - CODE4LIB Archives

Ken--

You may find a reason to create a normalized "stealth" field, but I have a couple of suggestions that will probably help you avoid that scenario.

1) Read up a little on the Unicode Normalization Forms (http://unicode.org/reports/tr15/) and convert all your UTF-8 characters to the composed form (NFC). The standard for MARC data is the decomposed form (NFD), but this is a real pain to work with if you like things to sort nicely (at least in MySQL). One way to do this is in perl with Unicode::Normalize. 

2) Use a collation other than utf8-bin (here's where you lost your case insensitivity, I think). Try utf8_unicode_ci (ci as in case insensitive).

I wish I had written down everything I learned about this stuff, but I didn't--and I keep having to go back and refresh my memory.

Mike
--
Michael Kreyche
Systems Librarian / Associate Professor
Libraries and Media Services 
Kent State University
330-672-1918

> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On 
> Behalf Of Ken Irwin
> Sent: Wednesday, December 16, 2009 1:26 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] character-sets for dummies?
> 
> Hi all -- thanks for these fabulous replies. I'm learning a lot. 
> 
> Armed with a bit of new knowledge, I've done some tinkering. 
> I think I've solved my original quandaries, and have opened 
> new cans of worms. I have a few more specific questions:
> 
> 1) It appears that once I switch my MySQL table over from a 
> latin character set to UTF-8, it is not longer 
> case-insensitive (this makes sense based on what I learned 
> from the Joel on Software post). All of the scripting I've 
> done until now takes advantage of the case insensitivity; is 
> there an easy way to keep this case insensitive while in UTF-8? 
> 
> 2) Is there a good/easy way to make the database agnostic 
> about diacritics, so that a search for "cafe" will also find "café" 
> 
> The answers to both of these may be "convert data to some 
> normalized A-Z field that never displays, but I can only 
> imagine that normalizing even 
> most-Roman-characters-with-diacritics to plain ASCII-style 
> characters can be daunting task.
> 
> Any advice on these particulars? 
> 
> Thanks,
> Ken
>