> > One of the questions this raises is what we are/aren't allowed to do in > terms of harvesting full-text. While I realise we could get into legal > stuff here, at the moment we want to put that question to one side. Instead > we want to consider what Google, and other search engines, do, the > mechanisms available to control this, and what we do, and the equivalent > mechanisms - our starting point is that we don't feel we should be at a > disadvantage to a web search engine in our harvesting and use of repository > records. > > Of course, Google and other crawlers can crawl the bits of the repository > that are on the open web, and 'good' crawlers will obey the contents of > robots.txt > We use OAI-PMH, and while we often see (usually general and sometimes > contradictory) statements about what we can/can't do with the contents of a > repository (or a specific record), it feels like there isn't a nice simple > mechanism for a repository to say "don't harvest this bit". > I would argue there is -- the whole point of OAI-PMH is to make stuff available for harvesting. If someone goes to the trouble of making things available via a protocol that exists only to make things harvestable and then doesn't want it harvested, you can dismiss them as being totally mental. OAI-PMH runs on top of HTTP, so anything robots.txt already applies -- i.e. if they want you to crawl metadata only but not download the objects themselves because they don't want to deal with the load or bandwidth charges, this should be indicated for all crawlers. kyle -- ---------------------------------------------------------- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance [log in to unmask] / 503.999.9787