On Wed, Feb 20, 2013 at 2:33 PM, Nathan Tallman <[log in to unmask]> wrote:
> @Péter: The VuFind solution I mentioned is very similar to what you use
> here. It uses Aperture (although soon to use Tika instead) to grab the
> full-text and shoves everything inside a solr index. The import is managed
> through a PHP script the crawls every URL on the sitemap. The only part I
> don't have is removing deleted, adding new, and updating changed
> webpages/files. I'm not sure how to rework the script to use a list of new
> files rather than the sitemap, but everything is on the same server so that
> should work.
A first step could be to record a timestamp of when a particular URL
is fetched. Then modify your PHP script to send an "If-Modified-Since"
header with the request. Assuming the target server adheres to basic
HTTP behavior, you'll get a 304 response and therefore know you don't
have to re-index that particular item.
(As an aside, could Google be ignoring items in your sitemap that it
thinks haven't changed?)
Maybe I'm misunderstanding though. The sitemap you mention has links
to html pages which then link to the PDFs? So you have to parse the
HTML to get the PDF URL? In that case, it still seems like recording
the last-fetched timestamps for the PDF URLs would be an option. I
know next to nothing about VuFind, so maybe the fetching mechanism
isn't exposed in a way to make this possible. I'm surprised it's not
already baked in, frankly.
One other thing that's confusing is the notion of "over 1000 PDFs"
taking a "long, long time". Even on fairly milquetoast hardware, I'd
expect solr to be capable of extracting and indexing 1000 PDF
documents in 20-30 minutes.