@Jason and @Michele: I'd rather stay away from a Google solution. The
reason being that they don't index everything. Our sitemap is submitted
nightly and out of about 6000 URLs only 1500 are indexed. I can't make sure
Google indexes the PDFs or be sure that they always will. (If I'm
misunderstanding this, please let me know.)
@Péter: The VuFind solution I mentioned is very similar to what you use
here. It uses Aperture (although soon to use Tika instead) to grab the
full-text and shoves everything inside a solr index. The import is managed
through a PHP script the crawls every URL on the sitemap. The only part I
don't have is removing deleted, adding new, and updating changed
webpages/files. I'm not sure how to rework the script to use a list of new
files rather than the sitemap, but everything is on the same server so that
should work.
On Wed, Feb 20, 2013 at 12:53 PM, Nathan Tallman <[log in to unmask]> wrote:
> My institution is looking for ways to provide search across PDFs through
> our website. Specifically, PDFs linked from finding aids. Ideally searching
> within a collection's PDFs or possibly across all PDFs linked from all
> finding aids.
>
> We do not have a CMS or a digital repository. A digital repository is on
> the horizon, but it's a ways out and we need to offer the search sooner.
> I've looked into Swish-e but haven't had much luck getting anything off the
> ground.
>
> One way we know we can do this through our discovery layer VuFind, using
> it's ability to full-text index a website based on a sitemap (which would
> includes PDFs linked from finding aids). Facets could be created for
> collections, and we may be able to create a search box on the finding aid
> nav that searches specifically that collection.
>
> But, I'm not sure how scalable that solution is. The indexing agent cannot
> discern when a page was updated, so it has to re-scrape,
> everything, every-night. The impetus collection is going to have about over
> 1000 PDFs. And that's to start. Creating the index will start to take a
> long, long time.
>
> Does anyone have any ideas or know of any useful tools for this project?
> Doesn't have to be perfect, quick and dirty may work. (The OCR's dirty
> anyway :-)
>
> Thanks,
> Nathan
>
>
>
>
|