As far as the google custom search solution, I'd add that sometimes it
yields weird results : for instance, we indexed a site and for a given
search term, google says "about 16 results" (we have 10 hits displayed
on the page) and when we click on page 2, it says "about 12 results"
(showing the two remaining hits). Ok, it says "about", but it's a bit
strange anyway that the system is not able to compute the proper number
of hits upfront (it occurs while using labels refinement.)
On the other hand, it's super easy to set up...
Le 20/02/2013 20:33, Nathan Tallman a écrit :
> @Jason and @Michele: I'd rather stay away from a Google solution. The
> reason being that they don't index everything. Our sitemap is submitted
> nightly and out of about 6000 URLs only 1500 are indexed. I can't make sure
> Google indexes the PDFs or be sure that they always will. (If I'm
> misunderstanding this, please let me know.)
> @Péter: The VuFind solution I mentioned is very similar to what you use
> here. It uses Aperture (although soon to use Tika instead) to grab the
> full-text and shoves everything inside a solr index. The import is managed
> through a PHP script the crawls every URL on the sitemap. The only part I
> don't have is removing deleted, adding new, and updating changed
> webpages/files. I'm not sure how to rework the script to use a list of new
> files rather than the sitemap, but everything is on the same server so that
> should work.
> On Wed, Feb 20, 2013 at 12:53 PM, Nathan Tallman <[log in to unmask]> wrote:
>> My institution is looking for ways to provide search across PDFs through
>> our website. Specifically, PDFs linked from finding aids. Ideally searching
>> within a collection's PDFs or possibly across all PDFs linked from all
>> finding aids.
>> We do not have a CMS or a digital repository. A digital repository is on
>> the horizon, but it's a ways out and we need to offer the search sooner.
>> I've looked into Swish-e but haven't had much luck getting anything off the
>> One way we know we can do this through our discovery layer VuFind, using
>> it's ability to full-text index a website based on a sitemap (which would
>> includes PDFs linked from finding aids). Facets could be created for
>> collections, and we may be able to create a search box on the finding aid
>> nav that searches specifically that collection.
>> But, I'm not sure how scalable that solution is. The indexing agent cannot
>> discern when a page was updated, so it has to re-scrape,
>> everything, every-night. The impetus collection is going to have about over
>> 1000 PDFs. And that's to start. Creating the index will start to take a
>> long, long time.
>> Does anyone have any ideas or know of any useful tools for this project?
>> Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty
>> anyway :-)
Agence Bibliographique de l'Enseignement Supérieur
227, avenue Professeur Jean Louis Viala
34193 Montpellier cedex 5
Tél : 33 (0)4 67 54 84 07
Fax : 33 (0)4 67 54 84 14