On Fri, Feb 24, 2012 at 6:50 AM, Owen Stephens <[log in to unmask]> wrote: > One obvious alternative that I keep coming back to is 'forget OAI-PMH, just crawl the web' ... Owen, I'd like to bring this back to your suggestion to just forget OAI-PMH and crawl the web. I think that's probably the long-term way forward. I've written some about how new aggregators might crawl the web for content. [1] The content that is then available to you would be greater than just those sites which happen to have an OAI-PMH gateway still. As long as the content is up on the Web in some form you could then make some use of it. There are some problems, though with crawling the web. For instance, how do you determine which pages are the item-level resources you're interested in? I think there are some simple solutions to that problem like URL templates and sitemaps. [2] These are things that will help make the sites from which you harvest more discoverable through the search engines as well. Google no longer uses OAI-PMH. You'd then be asking folks to optimize for the Web which would bring them added benefits, rather than optimizing just for aggregators that happen to know they have an OAI-PMH gateway. Another concern is curating what sites get crawled, and I think this is where having a site with an API which holds information about collections makes things easier. [3] Despite the challenges, crawling sites is the Web way of doing things. Then you can use robots.txt and meta tags to determine what you can't crawl or index. There's also a lot happening these days around open crawl data like the Common Crawl [3] and Web Data Commons [4]. I've written a bit about how this new, open, big Web data could be used by aggregators. [6] You may not have to crawl the Web yourself, but could reuse crawl data. In the short term as a transition, you could probably begin by merging the metadata-based approach of harvesting over OAI-PMH with Web crawl data. Until more sites are embedding semantic markup on their pages, having the fielded Dublin Core metadata can help build better applications than you might with just using natural language processing of the crawl data. But if all you have for a site is your crawl data, then at least you can do something with it and not exclude it. If you do decide to go in the direction of crawling the Web, I'd be interested in talking with you more about it. Jason [1] http://jronallo.github.com/blog/dpla-strawman-technical-proposal/ [2] http://jronallo.github.com/blog/solving-the-item-level-problem-on-the-web/ [3] Collection Achievements and Profiles System background: go.ncsu.edu/dplacaps and prototype: http://capsys.herokuapp.com/ [4] http://commoncrawl.org/ [5] http://page.mi.fu-berlin.de/muehleis/ccrdf/ [6] http://jronallo.github.com/blog/code4lib-2012-lightning-talk-that-wasnt/