I often recommend against stop word removal altogether. Is there any
reason you need to remove them?
The primary reason stop words get removed is to increase performance
of queries with very common terms. If you are encountering that,
using Solr's CommonGramsFilter(Factory) is a good solution to keep
your stop words and alleviate the performance degradation potential.
The HathiTrust folks have had success with the common grams capability.
On Nov 11, 2009, at 3:41 PM, Eric James wrote:
> Has anyone already given some thought into refining the solr
> stopwords.txt for library collections, particularly finding aids?
> The words included in the out of the box stopwords.txt are of very
> questionable unimportance:
> <an and are as at be but by for if in into is it not of on or s such
> t that the their then there these they this to was will with>
> We were indexing a field id with "no." as one of its tokens (for
> number), but wanted a query with "no" (where the person did not add
> the period) to find the doc, but in actuality the "no" would get
> stripped by the StopFilterFactory. And thus we stumbled upon this
> list, and was a bit suprised by some of the inclusions (ex:"will"),
> and exclusions( ex:"a").
> Eric James
> Yale University Libraries