Print

Print


I'm currently working on a project at The Open University in the UK called CORE (http://core-project.kmi.open.ac.uk/) which harvests metadata and full-text from institutional repositories at UK Universities, and then does analysis on the text to calculate (and make available openly) a measure of 'semantic similarity' between papers. The idea is to enable discovery of similar items (or I guess, dissimilar items if you wanted).

One of the questions this raises is what we are/aren't allowed to do in terms of harvesting full-text. While I realise we could get into legal stuff here, at the moment we want to put that question to one side. Instead we want to consider what Google, and other search engines, do, the mechanisms available to control this, and what we do, and the equivalent mechanisms - our starting point is that we don't feel we should be at a disadvantage to a web search engine in our harvesting and use of repository records.

Of course, Google and other crawlers can crawl the bits of the repository that are on the open web, and 'good' crawlers will obey the contents of robots.txt
We use OAI-PMH, and while we often see (usually general and sometimes contradictory) statements about what we can/can't do with the contents of a repository (or a specific record), it feels like there isn't a nice simple mechanism for a repository to say "don't harvest this bit".

To take an example (at random-ish and not to pick on anyone) the University of Southampton's repository has the follow robots.txt:
User-agent: *
Sitemap: http://eprints.soton.ac.uk/sitemap.xml
Disallow: /cgi/
Disallow: /66183/
This essentially allows Google et al to crawl the whole repository, with the exception of a single paper.
However, the OAI-PMH interface allows the whole repository to be harvested, and there seems to be nothing special about that particular paper, and nothing to say "please don't harvest".
Where there are statements via OAI-PMH about what is/isn't allowed we find that these are usually expressed as textual policies intended for human consumption, not designed to be (easily) machine readable.
We are left thinking it would be helpful if there was an equivalent of robots.txt for OAI-PMH interfaces. I've been asked to make a proposal for discussion, and I'd be interested in any ideas/comments code4lib people have. At the moment I'm wondering about:
a) a simple file like robots.txt which can allow/disallow harvesters (equivalent to User-agent), and allow/disallow records using a list of record ids (or set ids?)
b) use X-Robots-Tag in the http header
The latter has the advantage of being an existing way of doing it, but I wonder how fiddly it might be to implement, while a simple file in a known location might be easier. Anyway, thoughts appreciated - and alternatives to these of course. One obvious alternative that I keep coming back to is 'forget OAI-PMH, just crawl the web' ...
Thanks,
Owen
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [log in to unmask]
Telephone: 0121 288 6936