LISTSERV 16.5 - CODE4LIB Archives

Yes, Google Custom Search is not too bad, if your PDFs are sorted
meaningfully by directory, and if you submit a site map to Google for more
complete indexing.  You can use Xenu to make a site map, put the site map
online as a static XML file, and then use Google Webmaster Tools to pass
the location of the site map.  This helps Google to index your site more
completely.  Then you periodically recreate and update the site map.

For homegrown search, I would have recommended Swish-e, if you hadn't said
it was out of reach.

-Wilhelmina Randtke


On Wed, Feb 20, 2013 at 12:07 PM, Jason Griffey <[log in to unmask]> wrote:

> This might not fit your need exactly, but a Google Custom Search (
> http://www.google.com/cse/) should do the job. You can have the Custom
> Search only index a given directory, or only PDFs, whichever is more
> useful.
>
> Jason
>
>
> On Wed, Feb 20, 2013 at 12:53 PM, Nathan Tallman <[log in to unmask]>
> wrote:
>
> > My institution is looking for ways to provide search across PDFs through
> > our website. Specifically, PDFs linked from finding aids. Ideally
> searching
> > within a collection's PDFs or possibly across all PDFs linked from all
> > finding aids.
> >
> > We do not have a CMS or a digital repository. A digital repository is on
> > the horizon, but it's a ways out and we need to offer the search sooner.
> > I've looked into Swish-e but haven't had much luck getting anything off
> the
> > ground.
> >
> > One way we know we can do this through our discovery layer VuFind, using
> > it's ability to full-text index a website based on a sitemap (which would
> > includes PDFs linked from finding aids). Facets could be created for
> >  collections, and we may be able to create a search box on the finding
> aid
> > nav that searches specifically that collection.
> >
> > But, I'm not sure how scalable that solution is. The indexing agent
> cannot
> > discern when a page was updated, so it has to re-scrape,
> > everything, every-night. The impetus collection is going to have about
> over
> > 1000 PDFs. And that's to start. Creating the index will start to take a
> > long, long time.
> >
> > Does anyone have any ideas or know of any useful tools for this project?
> > Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty
> > anyway :-)
> >
> > Thanks,
> > Nathan
> >
>