Print

Print


On 5/24/05, Eric Lease Morgan <[log in to unmask]> wrote:
> On May 23, 2005, at 6:27 PM, Steven C. Perkins wrote:
>
> > I did a search on indigenous.  The first item was a French article.
> > The display of diacritics was messed up.  I added French to the
> > languages in IE, but the display was still bad.  I don't know if this
> > is a WinXP problem or a problem with your page.  I did not see a
> > language encoding on your source.  Perhaps UTF-8 will fix this?  Or it
> > may be a problem from the document retrieved.
>
> Yes, I do not know how to handle the extended ASCII characters, and I
> hoping someone here can point me in the right direction.
>
> As I said earlier, I use Net::OAI::Harvester to... harvest the data. I
> use MyLibrary to save the data to a MySQL database. I then write
> reports against the database in the form of a simple XML stream and
> feed the stream to swish-e for indexing. I know swish-e is unable to
> index multi-byte characters, and search results come directly from
> swish-e, not MyLibrary.
>

Will swish-e index the actual bytes of non-diacritic multibyte
characters?  If so, you can do what we do with Open-ILS (we use
Postgres' tsearch2 fulltest indexing module).  When indexing data, we
strip it of diacritical combining characters using 's/\p{M}//go'.
When a search is submitted we do the same thing, because a linked
search may contain the diacritics, or the searching user may be typing
in a non-US locale.  This will search the simplified strings and "does
the right thing", at least with our data.  We display the original
document (or a portion thereof) so that multibyte characters are
displayed.

For scripts that are entirely outside ASCII (Arabic, Kanji, etc) we
just index and search using the original bytes because they are not
matched by /\p{M}/.  In our testing this seems to work fine (of
course, we'd appreciate any tips on making this smarter).

> Maybe I should draw search results from MyLibrary and not swish-e to
> display characters correctly? If I draw content from many global
> sources, then how do I know what character set to use for display?
>

This is definitely the best thing to do.  Search the normallized data
and display the original.  Also, if you store the documents UTF-8
encoded you won't need to worry about the character set, you just need
to set the encoding for the page to UTF-8 and the browser will take
care of the rest.

--
Mike Rylander
[log in to unmask]
GPLS -- PINES Development
Database Developer
http://open-ils.org