LISTSERV 16.5 - CODE4LIB Archives

It seems like there are a number of eccentricities like that. I saw
somewhere in the documentation that if a word appears in more than half
the rows, it isn't searched. Because of that, I'm only using MATCH to
generate relevancy numbers. I'm doing boolean in the search terms. My
queries are like:

SELECT `path`, `mdate` FROM `rawtext` WHERE `rawText` LIKE '%book%' AND
`rawText` LIKE '%lists%' ORDER BY MATCH (`rawText`) AGAINST ('book
lists') DESC, `mdate` DESC

(Yes, I have a table named rawtext with a column named rawText. Shoot
me.)

It's far from perfect, but I just need to index a few hundred documents,
and it's for staff consumption. I would be curious to know how Postgres
compares to MySQL in this area. I'm looking towards finding a long-term
alternative to MySQL due to the Sun / Oracle merger, which I don't think
will end well for MySQL.

- David

---
David Cloutman <[log in to unmask]>
Electronic Services Librarian
Marin County Free Library 

-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
Genny Engel
Sent: Friday, May 29, 2009 12:41 PM
To: [log in to unmask]
Subject: Re: [CODE4LIB] MySQL Stop Words


Once I read a study where the document collection to be indexed was in a
narrow technical field, and the goal was to present a search that
quickly isolated ONLY the most relevant documents.  To this end, they
stopworded everything that didn't sufficiently distinguish one document
from another.   Their stopword list comprised some 30,000 terms!
 
If your goal, on the other hand, is to maximize recall at some expense
of precision, beware of MySQL full-text MATCH because it dynamically
computes new stopwords.  Note this little side note in section 11.8.1 of
the manual:
 
For very small tables, word distribution does not adequately reflect
their semantic value, and this model may sometimes produce bizarre
results. For example, although the word "MySQL" is present in every row
of the articles table shown earlier, a search for the word produces no
results [ ... ] The search result is empty because the word "MySQL" is
present in at least 50% of the rows. As such, it is effectively treated
as a stopword. For large data sets, this is the most desirable behavior:
A natural language query should not return every second row from a 1GB
table. For small data sets, it may be less desirable. 
 
 
 
 
Genny Engel
Sonoma County Library
[log in to unmask]
707 545-0831 x581
www.sonomalibrary.org
 


>>> [log in to unmask] 05/29/09 11:26AM >>>
In building a search function for some of our internal documents in PHP
/ MySQL, I took a look at the default list of MySQL English language
stop words used in the natural language searching feature. The list is
actually quite extensive, and goes well beyond the typical list of "to
be" cognates, common prepositions, conjunctions, etc. It also includes a
large number of keywords that librarians or academic users might want to
search for. Here are a few examples:

available
appropriate
course
follow
former
novel

There are quite a number of other stop words that I think are suspect.
The full list of stop words is located here:
http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html 

I guess the point is that if you're building a library application that
takes advantage of MySQL's fulltext searching features, you might want
to customize you stop words list on your MySQL installation if you think
your library users might want to search the word "novel".

- David

---
David Cloutman <[log in to unmask]>
Electronic Services Librarian
Marin County Free Library 

Email Disclaimer: http://www.co.marin.ca.us/nav/misc/EmailDisclaimer.cfm