LISTSERV 16.5 - CODE4LIB Archives

On Mon, Feb 27, 2012 at 12:15 PM, Jason Ronallo <[log in to unmask]> wrote:
> I'd like to bring this back to your suggestion to just forget OAI-PMH
> and crawl the web. I think that's probably the long-term way forward.

I definitely had the same thoughts while reading this thread. Owen,
are you forced to stay within the context of OAI-PMH because you are
working with existing institutional repositories? I don't know if it's
appropriate, or if it has been done before, but as part of your work
it would be interesting to determine:

a) how many IRs allow crawling (robots.txt or lack thereof)
b) how many IRs support crawling with a sitemap
c) how many IR HTML splashpages use the rel-license [1] pattern
d) how many IRs support syndication (RSS/Atom) to publish changes

If you could do this in a semi-automated way for the UK it would be
great if you could then apply it to IRs around the world. It would
also align really nicely with the sort of work that Jason has been
doing around CAPS [2].

It seems to me that there might be an opportunity to educate digital
repository managers about better aligning their content w/ the Web ...
instead of trying to cook up new standards. I imagine this is way out
of scope for what you are currently doing--if so, maybe this can be
your next grant :-)

//Ed

[1] http://microformats.org/wiki/rel-license
[2] https://github.com/jronallo/capsys