Agree that it is clear from this discussion that there are differing viewpoints and also very different requirements depending on the context and desired outcomes.
I think I said earlier in the thread - I'm not against niche solutions, they just make me want to double check that they are justified. For me I'd say the jury is still out on 'crawl' vs 'harvest' - but I think it definitely needs more investigation and thought - and of course different problems require different solutions. It would be interesting to try to go through the case for OAI-PMH, especially specific examples where it has achieved something that would have been difficult/impossible to do with more general solutions. Not sure if that could be done here on list, or better/easier through other discussion - or both (possibly over that beer? :)
From the CORE project, any 'best practice' would be focussed on institutional research publication repositories, and I it seems highly unlikely to make a recommendation on 'crawl' vs 'harvest' - we just won't have time to do enough work on this to understand the pros/cons of these even from our own singular perspective. I think any recommendations are more along the lines of ensuring robots.txt is consistent with other policies; the impact of using splash pages as opposed to links to actual resources in the OAI-PMH feed; configuring access to embargoed papers (as per Raffaele's suggestion); how to deal with multi-part resources etc. Anything coming out of the project would, of course, be just one projects recommendations for JISC to consider not more than that.
Owen Stephens Consulting
Email: [log in to unmask]
Telephone: 0121 288 6936
On 1 Mar 2012, at 14:38, Ian Ibbotson wrote:
> Just wanted to say that, whilst I've been silent since my initial response,
> I'm not sure I agree with all the viewpoints presented here.. From a point
> of view of (for example, CultureGrid) I'm not sure what has been done could
> have been pragmatically achieved soley with web crawling as it's described
> in this thread. Don't have a problem with anything thats been written here.
> It certainly represent a great cross-section of viewpoints. However, from a
> jisc discovery perspective, I don't want to contribute to any confirmation
> bias that we could dispose of pesky old OAI. I'd be interested in providing
> a counter-point to any "Best practice" document that suggested we could.
> On Thu, Mar 1, 2012 at 12:36 PM, Owen Stephens <[log in to unmask]> wrote:
>> Thanks Jason and Ed,
>> I suspect within this project we'll keep using OAI-PMH because we've got
>> tight deadlines and the other project strands (which do stuff with the
>> harvested content) need time from the developer. At the moment it looks
>> like we will probably combine OAI-PMH with web crawling (using nutch) - so
>> use data from the
>> However, that said, one of the things we are meant to be doing is offering
>> recommendations or good practice guidelines back to the (repository)
>> community based on our experience. If we have time I would love to tackle
>> the questions (a)-(d) that you highlight here - perhaps especially (a) and
>> (c). Since this particular project is part of the wider JISC 'Discovery'
>> programme (http://discovery.ac.uk and tech principles at
>> - from which one of the main themes might be summarised as 'work with the
>> web' these questions are definitely relevant.
>> I need to look at Jason's stuff again as I think this definitely has
>> parallels with some of the Discovery work, as, of course, does some of the
>> recent discussion on here about the question of the indexing of library
>> catalogues by search engines.
>> Thanks again to all who have contributed to the discussion - very useful
>> Owen Stephens
>> Owen Stephens Consulting
>> Web: http://www.ostephens.com
>> Email: [log in to unmask]
>> Telephone: 0121 288 6936
>> On 1 Mar 2012, at 11:42, Ed Summers wrote:
>>> On Mon, Feb 27, 2012 at 12:15 PM, Jason Ronallo <[log in to unmask]>
>>>> I'd like to bring this back to your suggestion to just forget OAI-PMH
>>>> and crawl the web. I think that's probably the long-term way forward.
>>> I definitely had the same thoughts while reading this thread. Owen,
>>> are you forced to stay within the context of OAI-PMH because you are
>>> working with existing institutional repositories? I don't know if it's
>>> appropriate, or if it has been done before, but as part of your work
>>> it would be interesting to determine:
>>> a) how many IRs allow crawling (robots.txt or lack thereof)
>>> b) how many IRs support crawling with a sitemap
>>> c) how many IR HTML splashpages use the rel-license  pattern
>>> d) how many IRs support syndication (RSS/Atom) to publish changes
>>> If you could do this in a semi-automated way for the UK it would be
>>> great if you could then apply it to IRs around the world. It would
>>> also align really nicely with the sort of work that Jason has been
>>> doing around CAPS .
>>> It seems to me that there might be an opportunity to educate digital
>>> repository managers about better aligning their content w/ the Web ...
>>> instead of trying to cook up new standards. I imagine this is way out
>>> of scope for what you are currently doing--if so, maybe this can be
>>> your next grant :-)
>>>  http://microformats.org/wiki/rel-license
>>>  https://github.com/jronallo/capsys