LISTSERV 16.5 - CODE4LIB Archives

Thanks Ian,

Agree that it is clear from this discussion that there are differing viewpoints and also very different requirements depending on the context and desired outcomes.

I think I said earlier in the thread - I'm not against niche solutions, they just make me want to double check that they are justified. For me I'd say the jury is still out on 'crawl' vs 'harvest' - but I think it definitely needs more investigation and thought - and of course different problems require different solutions. It would be interesting to try to go through the case for OAI-PMH, especially specific examples where it has achieved something that would have been difficult/impossible to do with more general solutions. Not sure if that could be done here on list, or better/easier through other discussion - or both (possibly over that beer? :)

From the CORE project, any 'best practice' would be focussed on institutional research publication repositories, and I it seems highly unlikely to make a recommendation on 'crawl' vs 'harvest' - we just won't have time to do enough work on this to understand the pros/cons of these even from our own singular perspective. I think any recommendations are more along the lines of ensuring robots.txt is consistent with other policies; the impact of using splash pages as opposed to links to actual resources in the OAI-PMH feed; configuring access to embargoed papers (as per Raffaele's suggestion); how to deal with multi-part resources etc. Anything coming out of the project would, of course, be just one projects recommendations for JISC to consider not more than that. 

Cheers,

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [log in to unmask]
Telephone: 0121 288 6936

On 1 Mar 2012, at 14:38, Ian Ibbotson wrote:

> Owen...
> 
> Just wanted to say that, whilst I've been silent since my initial response,
> I'm not sure I agree with all the viewpoints presented here.. From a point
> of view of (for example, CultureGrid) I'm not sure what has been done could
> have been pragmatically achieved soley with web crawling as it's described
> in this thread. Don't have a problem with anything thats been written here.
> It certainly represent a great cross-section of viewpoints. However, from a
> jisc discovery perspective, I don't want to contribute to any confirmation
> bias that we could dispose of pesky old OAI. I'd be interested in providing
> a counter-point to any "Best practice" document that suggested we could.
> 
> Ian.
> 
> On Thu, Mar 1, 2012 at 12:36 PM, Owen Stephens <[log in to unmask]> wrote:
> 
>> Thanks Jason and Ed,
>> 
>> I suspect within this project we'll keep using OAI-PMH because we've got
>> tight deadlines and the other project strands (which do stuff with the
>> harvested content) need time from the developer. At the moment it looks
>> like we will probably combine OAI-PMH with web crawling (using nutch) - so
>> use data from the
>> 
>> However, that said, one of the things we are meant to be doing is offering
>> recommendations or good practice guidelines back to the (repository)
>> community based on our experience. If we have time I would love to tackle
>> the questions (a)-(d) that you highlight here - perhaps especially (a) and
>> (c). Since this particular project is part of the wider JISC 'Discovery'
>> programme (http://discovery.ac.uk and tech principles at
>> http://technicalfoundations.ukoln.info/guidance/technical-principles-discovery-ecosystem)
>> - from which one of the main themes might be summarised as 'work with the
>> web' these questions are definitely relevant.
>> 
>> I need to look at Jason's stuff again as I think this definitely has
>> parallels with some of the Discovery work, as, of course, does some of the
>> recent discussion on here about the question of the indexing of library
>> catalogues by search engines.
>> 
>> Thanks again to all who have contributed to the discussion - very useful
>> 
>> Owen
>> 
>> Owen Stephens
>> Owen Stephens Consulting
>> Web: http://www.ostephens.com
>> Email: [log in to unmask]
>> Telephone: 0121 288 6936
>> 
>> On 1 Mar 2012, at 11:42, Ed Summers wrote:
>> 
>>> On Mon, Feb 27, 2012 at 12:15 PM, Jason Ronallo <[log in to unmask]>
>> wrote:
>>>> I'd like to bring this back to your suggestion to just forget OAI-PMH
>>>> and crawl the web. I think that's probably the long-term way forward.
>>> 
>>> I definitely had the same thoughts while reading this thread. Owen,
>>> are you forced to stay within the context of OAI-PMH because you are
>>> working with existing institutional repositories? I don't know if it's
>>> appropriate, or if it has been done before, but as part of your work
>>> it would be interesting to determine:
>>> 
>>> a) how many IRs allow crawling (robots.txt or lack thereof)
>>> b) how many IRs support crawling with a sitemap
>>> c) how many IR HTML splashpages use the rel-license [1] pattern
>>> d) how many IRs support syndication (RSS/Atom) to publish changes
>>> 
>>> If you could do this in a semi-automated way for the UK it would be
>>> great if you could then apply it to IRs around the world. It would
>>> also align really nicely with the sort of work that Jason has been
>>> doing around CAPS [2].
>>> 
>>> It seems to me that there might be an opportunity to educate digital
>>> repository managers about better aligning their content w/ the Web ...
>>> instead of trying to cook up new standards. I imagine this is way out
>>> of scope for what you are currently doing--if so, maybe this can be
>>> your next grant :-)
>>> 
>>> //Ed
>>> 
>>> [1] http://microformats.org/wiki/rel-license
>>> [2] https://github.com/jronallo/capsys
>>