Thanks, Erik, there is no specific reason for their removal, I think this was just that the StopFilterFactory is preconfigured in the analyzer chain for fieldType=text. We will do some performance testing with this filter removed.
BTW, a useful tool in deciding appropriate stopwords is the schema browser, which can be found on the /solr/admin page. Here you can see term frequencies for each of the fields sorted from highest frequency to help weed out the terms of little querying value.
Eric
> Date: Thu, 12 Nov 2009 09:06:46 -0500
> From: [log in to unmask]
> Subject: Re: [CODE4LIB] solr | StopFilterFactory - stopwords.txt
> To: [log in to unmask]
>
> I often recommend against stop word removal altogether. Is there any
> reason you need to remove them?
>
> The primary reason stop words get removed is to increase performance
> of queries with very common terms. If you are encountering that,
> using Solr's CommonGramsFilter(Factory) is a good solution to keep
> your stop words and alleviate the performance degradation potential.
> The HathiTrust folks have had success with the common grams capability.
>
> Erik
>
>
> On Nov 11, 2009, at 3:41 PM, Eric James wrote:
>
> > Has anyone already given some thought into refining the solr
> > stopwords.txt for library collections, particularly finding aids?
> > The words included in the out of the box stopwords.txt are of very
> > questionable unimportance:
> >
> > <an and are as at be but by for if in into is it not of on or s such
> > t that the their then there these they this to was will with>
> >
> >
> >
> > We were indexing a field id with "no." as one of its tokens (for
> > number), but wanted a query with "no" (where the person did not add
> > the period) to find the doc, but in actuality the "no" would get
> > stripped by the StopFilterFactory. And thus we stumbled upon this
> > list, and was a bit suprised by some of the inclusions (ex:"will"),
> > and exclusions( ex:"a").
> >
> >
> >
> > Thanks,
> >
> > Eric James
> >
> > Yale University Libraries
> >
|