Print

Print


On Feb 24, 2012, at 9:25 AM, Kyle Banerjee wrote:

>> 
>> One of the questions this raises is what we are/aren't allowed to do in
>> terms of harvesting full-text. While I realise we could get into legal
>> stuff here, at the moment we want to put that question to one side. Instead
>> we want to consider what Google, and other search engines, do, the
>> mechanisms available to control this, and what we do, and the equivalent
>> mechanisms - our starting point is that we don't feel we should be at a
>> disadvantage to a web search engine in our harvesting and use of repository
>> records.
>> 
>> Of course, Google and other crawlers can crawl the bits of the repository
>> that are on the open web, and 'good' crawlers will obey the contents of
>> robots.txt
>> We use OAI-PMH, and while we often see (usually general and sometimes
>> contradictory) statements about what we can/can't do with the contents of a
>> repository (or a specific record), it feels like there isn't a nice simple
>> mechanism for a repository to say "don't harvest this bit".
>> 
> 
> I would argue there is -- the whole point of OAI-PMH is to make stuff
> available for harvesting. If someone goes to the trouble of making things
> available via a protocol that exists only to make things harvestable and
> then doesn't want it harvested, you can dismiss them as being totally
> mental.

I see it like the people who request that their pages not be cached elsewhere -- they want to make their object 'discoverable', but they want to control the access to those objects -- so it's one thing for a search engine to get a copy, but they don't want that search engine being an agent to distribute copies to others.

Eg, all of the journal publishers who charge access fees -- they want people to find that they have a copy of that article that you're interested in ... but they want to collect their $35 for you to read it.

In the case of scientific data, the problem is that to make stuff discoverable, we often have to perform some lossy transformation to fit some metadata standard, and those standards rarely have mechanisms for describing error (accuracy, precision, etc.).  You can do some science with the catalog records, but it's going to introduce some bias into your results, so you're typically better of getting the data from the archive.  (and sometimes, they have nice clean catalogs in FITS, VOTable, CDF, NetCDF, HDF or whatever their discipline's preferred data format is)

...

Also, I don't know if things have changed in the last year, but I seem to remember someone mentioning at last year's RDAP (Research Data Access & Preservation) summit that Google had coordinated with some libraries for feeds from their catalogs, but was only interested in books, not other objects.

I don't know how other search engines might use data from OAI-PMH, or if they'd filter it because they didn't consider it to be information they cared about.

-Joe