However, all of these oddities -- over eager stop-list, ignoring short
words, not counting words in more than half the rows -- can be sorted
out by configuration options. I'm sorry I don't have them to hand,
but half an hour or so on Google should give you the information you
need to make MySQL act like a half-decent text-search engine.
_/|_ ___________________________________________________________________
/o ) \/ Mike Taylor <[log in to unmask]> http://www.miketaylor.org.uk
)_v__/\ "The Pope claimed he'd been wrong in the past; this was a big
surprise" -- Sting, "Jeremiah Blues"
Cloutman, David writes:
> It seems like there are a number of eccentricities like that. I saw
> somewhere in the documentation that if a word appears in more than half
> the rows, it isn't searched. Because of that, I'm only using MATCH to
> generate relevancy numbers. I'm doing boolean in the search terms. My
> queries are like:
>
> SELECT `path`, `mdate` FROM `rawtext` WHERE `rawText` LIKE '%book%' AND
> `rawText` LIKE '%lists%' ORDER BY MATCH (`rawText`) AGAINST ('book
> lists') DESC, `mdate` DESC
>
> (Yes, I have a table named rawtext with a column named rawText. Shoot
> me.)
>
> It's far from perfect, but I just need to index a few hundred documents,
> and it's for staff consumption. I would be curious to know how Postgres
> compares to MySQL in this area. I'm looking towards finding a long-term
> alternative to MySQL due to the Sun / Oracle merger, which I don't think
> will end well for MySQL.
>
> - David
>
> ---
> David Cloutman <[log in to unmask]>
> Electronic Services Librarian
> Marin County Free Library
>
> -----Original Message-----
> From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of
> Genny Engel
> Sent: Friday, May 29, 2009 12:41 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] MySQL Stop Words
>
>
> Once I read a study where the document collection to be indexed was in a
> narrow technical field, and the goal was to present a search that
> quickly isolated ONLY the most relevant documents. To this end, they
> stopworded everything that didn't sufficiently distinguish one document
> from another. Their stopword list comprised some 30,000 terms!
>
> If your goal, on the other hand, is to maximize recall at some expense
> of precision, beware of MySQL full-text MATCH because it dynamically
> computes new stopwords. Note this little side note in section 11.8.1 of
> the manual:
>
> For very small tables, word distribution does not adequately reflect
> their semantic value, and this model may sometimes produce bizarre
> results. For example, although the word "MySQL" is present in every row
> of the articles table shown earlier, a search for the word produces no
> results [ ... ] The search result is empty because the word "MySQL" is
> present in at least 50% of the rows. As such, it is effectively treated
> as a stopword. For large data sets, this is the most desirable behavior:
> A natural language query should not return every second row from a 1GB
> table. For small data sets, it may be less desirable.
>
>
>
>
> Genny Engel
> Sonoma County Library
> [log in to unmask]
> 707 545-0831 x581
> www.sonomalibrary.org
>
>
>
> >>> [log in to unmask] 05/29/09 11:26AM >>>
> In building a search function for some of our internal documents in PHP
> / MySQL, I took a look at the default list of MySQL English language
> stop words used in the natural language searching feature. The list is
> actually quite extensive, and goes well beyond the typical list of "to
> be" cognates, common prepositions, conjunctions, etc. It also includes a
> large number of keywords that librarians or academic users might want to
> search for. Here are a few examples:
>
> available
> appropriate
> course
> follow
> former
> novel
>
> There are quite a number of other stop words that I think are suspect.
> The full list of stop words is located here:
> http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html
>
> I guess the point is that if you're building a library application that
> takes advantage of MySQL's fulltext searching features, you might want
> to customize you stop words list on your MySQL installation if you think
> your library users might want to search the word "novel".
>
> - David
>
> ---
> David Cloutman <[log in to unmask]>
> Electronic Services Librarian
> Marin County Free Library
>
> Email Disclaimer: http://www.co.marin.ca.us/nav/misc/EmailDisclaimer.cfm
|