LISTSERV 16.5 - CODE4LIB Archives

On Fri, Feb 24, 2012 at 6:50 AM, Owen Stephens <[log in to unmask]> wrote:
> One obvious alternative that I keep coming back to is 'forget OAI-PMH, just crawl the web' ...

Owen,

I'd like to bring this back to your suggestion to just forget OAI-PMH
and crawl the web. I think that's probably the long-term way forward.
I've written some about how new aggregators might crawl the web for
content. [1] The content that is then available to you would be
greater than just those sites which happen to have an OAI-PMH gateway
still. As long as the content is up on the Web in some form you could
then make some use of it.

There are some problems, though with crawling the web. For instance,
how do you determine which pages are the item-level resources you're
interested in? I think there are some simple solutions to that problem
like URL templates and sitemaps. [2] These are things that will help
make the sites from which you harvest more discoverable through the
search engines as well. Google no longer uses OAI-PMH. You'd then be
asking folks to optimize for the Web which would bring them added
benefits, rather than optimizing just for aggregators that happen to
know they have an OAI-PMH gateway.

Another concern is curating what sites get crawled, and I think this
is where having a site with an API which holds information about
collections makes things easier. [3]

Despite the challenges, crawling sites is the Web way of doing things.
Then you can use robots.txt and meta tags to determine what you can't
crawl or index. There's also a lot happening these days around open
crawl data like the Common Crawl [3] and Web Data Commons [4]. I've
written a bit about how this new, open, big Web data could be used by
aggregators. [6] You may not have to crawl the Web yourself, but could
reuse crawl data.

In the short term as a transition, you could probably begin by merging
the metadata-based approach of harvesting over OAI-PMH with Web crawl
data. Until more sites are embedding semantic markup on their pages,
having the fielded Dublin Core metadata can help build better
applications than you might with just using natural language
processing of the crawl data. But if all you have for a site is your
crawl data, then at least you can do something with it and not exclude
it.

If you do decide to go in the direction of crawling the Web, I'd be
interested in talking with you more about it.

Jason

[1] http://jronallo.github.com/blog/dpla-strawman-technical-proposal/
[2] http://jronallo.github.com/blog/solving-the-item-level-problem-on-the-web/
[3] Collection Achievements and Profiles System background:
go.ncsu.edu/dplacaps and prototype: http://capsys.herokuapp.com/
[4] http://commoncrawl.org/
[5] http://page.mi.fu-berlin.de/muehleis/ccrdf/
[6] http://jronallo.github.com/blog/code4lib-2012-lightning-talk-that-wasnt/