I've had sight of this and generated a dictionary from the LoC bib data. It's very fast and the suggestions are excellent, including multi-word corrections.
rob
Here's Martin's mail with the details - but I encourage you to join the group
----
But I'm sure you'd like to get your hands on it, so I packaged up a convenient way to quickly make a dictionary from your own data files, and run your own queries through it.
Grab this jar: http://groups.google.com/group/spelt/web/spelt.jar
Everything is included (Spelt and Lucene). The only requirement is that you have JDK 1.5 (5.0) -- it won't work with JDK 1.4.
The utility includes a fast-but-dumb text ripper that does a deep directory scan for textual files, and pulls out all the words. It should be able to handle XML, HTML, and plain text files (provided they're in UTF-8 encoding.) You can build a dictionary this way:
java -jar spelt.jar -build <your-src-dir> speltDictDir
If you run it on a big data set, I'd suggest giving it more RAM, like this:
java -Xmx750m -jar spelt.jar -build <your-src-dir> speltDictDir
You can run a set of test queries (e.g, http://groups.google.com/group/spelt/web/test.list) like this:
java -jar spelt.jar -test speltDictDir test.list
Finally, if you are curious about how this compares with the exisitng code in Lucene, you can add the -old flag just before "-build" or "-test". Warning: the build process is about 35 times slower on my machine, so I'd suggest doing this on a small data set.
-----Original Message-----
From: Code for Libraries on behalf of Jonathan Rochkind
Sent: Tue 03/04/2007 7:01 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] pspell aspell: make your own word lists/dictionaries
I haven't had time to look at it yet, but someone at Code4Lib conference
proposed a more sophisticated approach to spell checking that sounded
really interesting to me, and said he was going to share the code. I
hope to have time to investigate at some point.
Let's see if I can find it on the conference page.... yeah, it was
Martin Haye. You can watch his presentation here:
http://video.google.com/videoplay?docid=4028600349627496246&hl=en
Looks like he's *martin*.*haye*[at]gmail.com. During the lightning
talk, he said he didn't want to distribute the code seperately but
wanted to include it in Lucene if possible---but later in the
conference, he said he had been convinced by the interest in it to
distriburte the code as it's own standalone thing, and planned to do
that presently.
If anyone does or has explored using martin's code, please let us know
about your experience.
Jonathan
Kevin Kierans wrote:
> Has anyone created their own "dictionaries"
> for aspell? We've created blank delimited
> lists of words from our opac. One for title,
> one for subjects, and one for authors. (We're thinking
> of a series one as well)
>
> We would like to use
> one of these word lists to offer suggestions
> depending on which search the patron is making.
> We're assuming we can make better suggestions
> if the words come from our actual opac.
>
> We've got it working with the dictionary that
> comes with aspell, but having problems (we can't do it!)
> substituting our own "dictionaries."
>
> Does anyone have any experience/knowledge/hints/pointers
> they can share with us?
>
> We are using linux, php 5, aspell 0.50.5, and
> php -> pspell functions.
>
> Thanks,
> Kevin
> TNRD Library System, Kamloops, British Columbia, Canada
>
>
--
Jonathan Rochkind
Sr. Programmer/Analyst
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu
The very latest from Talis
read the latest news at www.talis.com/news
listen to our podcasts www.talis.com/podcasts
see us at these events www.talis.com/events
join the discussion here www.talis.com/forums
join our developer community www.talis.com/tdn
and read our blogs www.talis.com/blogs
Any views or personal opinions expressed within this email may not be those of Talis Information Ltd. The content of this email message and any files that may be attached are confidential, and for the usage of the intended recipient only. If you are not the intended recipient, then please return this message to the sender and delete it. Any use of this e-mail by an unauthorised recipient is prohibited.
Talis Information Ltd is a member of the Talis Group of companies and is registered in England No 3638278 with its registered office at Knights Court, Solihull Parkway, Birmingham Business Park, B37 7YB.
|