However, all of these oddities -- over eager stop-list, ignoring short words, not counting words in more than half the rows -- can be sorted out by configuration options. I'm sorry I don't have them to hand, but half an hour or so on Google should give you the information you need to make MySQL act like a half-decent text-search engine. _/|_ ___________________________________________________________________ /o ) \/ Mike Taylor <[log in to unmask]> http://www.miketaylor.org.uk )_v__/\ "The Pope claimed he'd been wrong in the past; this was a big surprise" -- Sting, "Jeremiah Blues" Cloutman, David writes: > It seems like there are a number of eccentricities like that. I saw > somewhere in the documentation that if a word appears in more than half > the rows, it isn't searched. Because of that, I'm only using MATCH to > generate relevancy numbers. I'm doing boolean in the search terms. My > queries are like: > > SELECT `path`, `mdate` FROM `rawtext` WHERE `rawText` LIKE '%book%' AND > `rawText` LIKE '%lists%' ORDER BY MATCH (`rawText`) AGAINST ('book > lists') DESC, `mdate` DESC > > (Yes, I have a table named rawtext with a column named rawText. Shoot > me.) > > It's far from perfect, but I just need to index a few hundred documents, > and it's for staff consumption. I would be curious to know how Postgres > compares to MySQL in this area. I'm looking towards finding a long-term > alternative to MySQL due to the Sun / Oracle merger, which I don't think > will end well for MySQL. > > - David > > --- > David Cloutman <[log in to unmask]> > Electronic Services Librarian > Marin County Free Library > > -----Original Message----- > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of > Genny Engel > Sent: Friday, May 29, 2009 12:41 PM > To: [log in to unmask] > Subject: Re: [CODE4LIB] MySQL Stop Words > > > Once I read a study where the document collection to be indexed was in a > narrow technical field, and the goal was to present a search that > quickly isolated ONLY the most relevant documents. To this end, they > stopworded everything that didn't sufficiently distinguish one document > from another. Their stopword list comprised some 30,000 terms! > > If your goal, on the other hand, is to maximize recall at some expense > of precision, beware of MySQL full-text MATCH because it dynamically > computes new stopwords. Note this little side note in section 11.8.1 of > the manual: > > For very small tables, word distribution does not adequately reflect > their semantic value, and this model may sometimes produce bizarre > results. For example, although the word "MySQL" is present in every row > of the articles table shown earlier, a search for the word produces no > results [ ... ] The search result is empty because the word "MySQL" is > present in at least 50% of the rows. As such, it is effectively treated > as a stopword. For large data sets, this is the most desirable behavior: > A natural language query should not return every second row from a 1GB > table. For small data sets, it may be less desirable. > > > > > Genny Engel > Sonoma County Library > [log in to unmask] > 707 545-0831 x581 > www.sonomalibrary.org > > > > >>> [log in to unmask] 05/29/09 11:26AM >>> > In building a search function for some of our internal documents in PHP > / MySQL, I took a look at the default list of MySQL English language > stop words used in the natural language searching feature. The list is > actually quite extensive, and goes well beyond the typical list of "to > be" cognates, common prepositions, conjunctions, etc. It also includes a > large number of keywords that librarians or academic users might want to > search for. Here are a few examples: > > available > appropriate > course > follow > former > novel > > There are quite a number of other stop words that I think are suspect. > The full list of stop words is located here: > http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html > > I guess the point is that if you're building a library application that > takes advantage of MySQL's fulltext searching features, you might want > to customize you stop words list on your MySQL installation if you think > your library users might want to search the word "novel". > > - David > > --- > David Cloutman <[log in to unmask]> > Electronic Services Librarian > Marin County Free Library > > Email Disclaimer: http://www.co.marin.ca.us/nav/misc/EmailDisclaimer.cfm