On Wed, Feb 20, 2013 at 2:33 PM, Nathan Tallman <[log in to unmask]> wrote: > @Péter: The VuFind solution I mentioned is very similar to what you use > here. It uses Aperture (although soon to use Tika instead) to grab the > full-text and shoves everything inside a solr index. The import is managed > through a PHP script the crawls every URL on the sitemap. The only part I > don't have is removing deleted, adding new, and updating changed > webpages/files. I'm not sure how to rework the script to use a list of new > files rather than the sitemap, but everything is on the same server so that > should work. Nathan, A first step could be to record a timestamp of when a particular URL is fetched. Then modify your PHP script to send an "If-Modified-Since" header with the request. Assuming the target server adheres to basic HTTP behavior, you'll get a 304 response and therefore know you don't have to re-index that particular item. (As an aside, could Google be ignoring items in your sitemap that it thinks haven't changed?) Maybe I'm misunderstanding though. The sitemap you mention has links to html pages which then link to the PDFs? So you have to parse the HTML to get the PDF URL? In that case, it still seems like recording the last-fetched timestamps for the PDF URLs would be an option. I know next to nothing about VuFind, so maybe the fetching mechanism isn't exposed in a way to make this possible. I'm surprised it's not already baked in, frankly. One other thing that's confusing is the notion of "over 1000 PDFs" taking a "long, long time". Even on fairly milquetoast hardware, I'd expect solr to be capable of extracting and indexing 1000 PDF documents in 20-30 minutes. --jay