Print

Print


However, all of these oddities -- over eager stop-list, ignoring short
words, not counting words in more than half the rows -- can be sorted
out by configuration options.  I'm sorry I don't have them to hand,
but half an hour or so on Google should give you the information you
need to make MySQL act like a half-decent text-search engine.

 _/|_	 ___________________________________________________________________
/o ) \/  Mike Taylor    <[log in to unmask]>    http://www.miketaylor.org.uk
)_v__/\  "The Pope claimed he'd been wrong in the past; this was a big
	 surprise" -- Sting, "Jeremiah Blues"



Cloutman, David writes:
 > It seems like there are a number of eccentricities like that. I saw
 > somewhere in the documentation that if a word appears in more than half
 > the rows, it isn't searched. Because of that, I'm only using MATCH to
 > generate relevancy numbers. I'm doing boolean in the search terms. My
 > queries are like:
 > 
 > SELECT `path`, `mdate` FROM `rawtext` WHERE `rawText` LIKE '%book%' AND
 > `rawText` LIKE '%lists%' ORDER BY MATCH (`rawText`) AGAINST ('book
 > lists') DESC, `mdate` DESC
 > 
 > (Yes, I have a table named rawtext with a column named rawText. Shoot
 > me.)
 > 
 > It's far from perfect, but I just need to index a few hundred documents,
 > and it's for staff consumption. I would be curious to know how Postgres
 > compares to MySQL in this area. I'm looking towards finding a long-term
 > alternative to MySQL due to the Sun / Oracle merger, which I don't think
 > will end well for MySQL.
 > 
 > - David
 > 
 > ---
 > David Cloutman <[log in to unmask]>
 > Electronic Services Librarian
 > Marin County Free Library 
 > 
 > -----Original Message-----
 > From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
 > Genny Engel
 > Sent: Friday, May 29, 2009 12:41 PM
 > To: [log in to unmask]
 > Subject: Re: [CODE4LIB] MySQL Stop Words
 > 
 > 
 > Once I read a study where the document collection to be indexed was in a
 > narrow technical field, and the goal was to present a search that
 > quickly isolated ONLY the most relevant documents.  To this end, they
 > stopworded everything that didn't sufficiently distinguish one document
 > from another.   Their stopword list comprised some 30,000 terms!
 >  
 > If your goal, on the other hand, is to maximize recall at some expense
 > of precision, beware of MySQL full-text MATCH because it dynamically
 > computes new stopwords.  Note this little side note in section 11.8.1 of
 > the manual:
 >  
 > For very small tables, word distribution does not adequately reflect
 > their semantic value, and this model may sometimes produce bizarre
 > results. For example, although the word "MySQL" is present in every row
 > of the articles table shown earlier, a search for the word produces no
 > results [ ... ] The search result is empty because the word "MySQL" is
 > present in at least 50% of the rows. As such, it is effectively treated
 > as a stopword. For large data sets, this is the most desirable behavior:
 > A natural language query should not return every second row from a 1GB
 > table. For small data sets, it may be less desirable. 
 >  
 >  
 >  
 >  
 > Genny Engel
 > Sonoma County Library
 > [log in to unmask]
 > 707 545-0831 x581
 > www.sonomalibrary.org
 >  
 > 
 > 
 > >>> [log in to unmask] 05/29/09 11:26AM >>>
 > In building a search function for some of our internal documents in PHP
 > / MySQL, I took a look at the default list of MySQL English language
 > stop words used in the natural language searching feature. The list is
 > actually quite extensive, and goes well beyond the typical list of "to
 > be" cognates, common prepositions, conjunctions, etc. It also includes a
 > large number of keywords that librarians or academic users might want to
 > search for. Here are a few examples:
 > 
 > available
 > appropriate
 > course
 > follow
 > former
 > novel
 > 
 > There are quite a number of other stop words that I think are suspect.
 > The full list of stop words is located here:
 > http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html 
 > 
 > I guess the point is that if you're building a library application that
 > takes advantage of MySQL's fulltext searching features, you might want
 > to customize you stop words list on your MySQL installation if you think
 > your library users might want to search the word "novel".
 > 
 > - David
 > 
 > ---
 > David Cloutman <[log in to unmask]>
 > Electronic Services Librarian
 > Marin County Free Library 
 > 
 > Email Disclaimer: http://www.co.marin.ca.us/nav/misc/EmailDisclaimer.cfm