I've had sight of this and generated a dictionary from the LoC bib data. It's very fast and the suggestions are excellent, including multi-word corrections. rob Here's Martin's mail with the details - but I encourage you to join the group ---- But I'm sure you'd like to get your hands on it, so I packaged up a convenient way to quickly make a dictionary from your own data files, and run your own queries through it. Grab this jar: http://groups.google.com/group/spelt/web/spelt.jar Everything is included (Spelt and Lucene). The only requirement is that you have JDK 1.5 (5.0) -- it won't work with JDK 1.4. The utility includes a fast-but-dumb text ripper that does a deep directory scan for textual files, and pulls out all the words. It should be able to handle XML, HTML, and plain text files (provided they're in UTF-8 encoding.) You can build a dictionary this way: java -jar spelt.jar -build <your-src-dir> speltDictDir If you run it on a big data set, I'd suggest giving it more RAM, like this: java -Xmx750m -jar spelt.jar -build <your-src-dir> speltDictDir You can run a set of test queries (e.g, http://groups.google.com/group/spelt/web/test.list) like this: java -jar spelt.jar -test speltDictDir test.list Finally, if you are curious about how this compares with the exisitng code in Lucene, you can add the -old flag just before "-build" or "-test". Warning: the build process is about 35 times slower on my machine, so I'd suggest doing this on a small data set. -----Original Message----- From: Code for Libraries on behalf of Jonathan Rochkind Sent: Tue 03/04/2007 7:01 PM To: [log in to unmask] Subject: Re: [CODE4LIB] pspell aspell: make your own word lists/dictionaries I haven't had time to look at it yet, but someone at Code4Lib conference proposed a more sophisticated approach to spell checking that sounded really interesting to me, and said he was going to share the code. I hope to have time to investigate at some point. Let's see if I can find it on the conference page.... yeah, it was Martin Haye. You can watch his presentation here: http://video.google.com/videoplay?docid=4028600349627496246&hl=en Looks like he's *martin*.*haye*[at]gmail.com. During the lightning talk, he said he didn't want to distribute the code seperately but wanted to include it in Lucene if possible---but later in the conference, he said he had been convinced by the interest in it to distriburte the code as it's own standalone thing, and planned to do that presently. If anyone does or has explored using martin's code, please let us know about your experience. Jonathan Kevin Kierans wrote: > Has anyone created their own "dictionaries" > for aspell? We've created blank delimited > lists of words from our opac. One for title, > one for subjects, and one for authors. (We're thinking > of a series one as well) > > We would like to use > one of these word lists to offer suggestions > depending on which search the patron is making. > We're assuming we can make better suggestions > if the words come from our actual opac. > > We've got it working with the dictionary that > comes with aspell, but having problems (we can't do it!) > substituting our own "dictionaries." > > Does anyone have any experience/knowledge/hints/pointers > they can share with us? > > We are using linux, php 5, aspell 0.50.5, and > php -> pspell functions. > > Thanks, > Kevin > TNRD Library System, Kamloops, British Columbia, Canada > > -- Jonathan Rochkind Sr. Programmer/Analyst The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu The very latest from Talis read the latest news at www.talis.com/news listen to our podcasts www.talis.com/podcasts see us at these events www.talis.com/events join the discussion here www.talis.com/forums join our developer community www.talis.com/tdn and read our blogs www.talis.com/blogs Any views or personal opinions expressed within this email may not be those of Talis Information Ltd. The content of this email message and any files that may be attached are confidential, and for the usage of the intended recipient only. If you are not the intended recipient, then please return this message to the sender and delete it. Any use of this e-mail by an unauthorised recipient is prohibited. Talis Information Ltd is a member of the Talis Group of companies and is registered in England No 3638278 with its registered office at Knights Court, Solihull Parkway, Birmingham Business Park, B37 7YB.