Once I read a study where the document collection to be indexed was in a narrow technical field, and the goal was to present a search that quickly isolated ONLY the most relevant documents. To this end, they stopworded everything that didn't sufficiently distinguish one document from another. Their stopword list comprised some 30,000 terms!
If your goal, on the other hand, is to maximize recall at some expense of precision, beware of MySQL full-text MATCH because it dynamically computes new stopwords. Note this little side note in section 11.8.1 of the manual:
For very small tables, word distribution does not adequately reflect their semantic value, and this model may sometimes produce bizarre results. For example, although the word "MySQL" is present in every row of the articles table shown earlier, a search for the word produces no results [ ... ] The search result is empty because the word "MySQL" is present in at least 50% of the rows. As such, it is effectively treated as a stopword. For large data sets, this is the most desirable behavior: A natural language query should not return every second row from a 1GB table. For small data sets, it may be less desirable.
Sonoma County Library
[log in to unmask]
707 545-0831 x581
>>> [log in to unmask] 05/29/09 11:26AM >>>
In building a search function for some of our internal documents in PHP
/ MySQL, I took a look at the default list of MySQL English language
stop words used in the natural language searching feature. The list is
actually quite extensive, and goes well beyond the typical list of "to
be" cognates, common prepositions, conjunctions, etc. It also includes a
large number of keywords that librarians or academic users might want to
search for. Here are a few examples:
There are quite a number of other stop words that I think are suspect.
The full list of stop words is located here:
I guess the point is that if you're building a library application that
takes advantage of MySQL's fulltext searching features, you might want
to customize you stop words list on your MySQL installation if you think
your library users might want to search the word "novel".
David Cloutman <[log in to unmask]>
Electronic Services Librarian
Marin County Free Library
Email Disclaimer: http://www.co.marin.ca.us/nav/misc/EmailDisclaimer.cfm