LISTSERV 16.5 - CODE4LIB Archives

On 26 Feb 2012, at 14:42, Godmar Back wrote:

> May I ask a side question and make a side observation regarding the
> harvesting of full text of the object to which a OAI-PMH record refers?
> 
> In general, is the idea to use the <dc:source>/text() element, treat it as
> a URL, and then expect to find the object there (provided that there was a
> suitable <dc:type> and <dc:format> element)?
> 
I think dc:identifier is usually used to provide a URL for the item being described. The examples at http://www.openarchives.org/OAI/openarchivesprotocol.html#dublincore follow this, and the UK E-Thesis schema (http://naca.central.cranfield.ac.uk/ethos-oai/2.0/oai-uketd.xml) does as well.

> Example: http://scholar.lib.vt.edu/theses/OAI/cgi-bin/index.pl allows the
> harvesting of ETD metadata.  Yet, its metadata reads:
> 
> <ListRecords>
>   ....
>   <metadata>
>     <dc>
>        <type>text</type>
>        <format>application/pdf</format>
>        <source>
> http://scholar.lib.vt.edu/theses/available/etd-3345131939761081/</source>
>    ....
> 
> 
> When one visits
> http://scholar.lib.vt.edu/theses/available/etd-3345131939761081/ however
> there is no 'text' document of type 'application/pdf' - rather, it's an
> HTML title page that embeds links to one or more PDF documents, such as
> http://scholar.lib.vt.edu/theses/available/etd-3345131939761081/unrestricted/Walker_1.pdfto
> Walker_5.pdf.
> 
> Is VT's ETD OAI implementation deficient, or is OAI-PMH simply not set up
> to allow the harvesting of full-text without what would basically amount to
> crawling the ETD title page, or other repository-specific mechanisms?

This issue is certainly not unique to VT - we've come across this as part of our project. While the OAI-PMH record may point at the PDF, it can also point to a intermediary page. This seems to be standard practice in some instances - I think because there is a desire, or even requirement, that a user should see the intermediary page (which may contain rights information etc.) before viewing the full-text item. There may also be an issue where multiple files exist for the same item - maybe several data files and a pdf of the thesis attached to the same metadata record - as the metadata via OAI-PMH may not describe each asset.

I suspect you'd see some specific approaches depending on the default settings in different packages. For example this (highly truncated) record from Southampton (who use eprints) differentiates the full-text link from the repository page by using dc:relation for the latter and dc:identifier for the former:

<record>
    <header>
      <identifier>oai:eprints.soton.ac.uk:66183</identifier>
	</header>
    <metadata>
      <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
        <dc:title>A methodology for developing high damping materials with application to noise reduction of railway track</dc:title>
        <dc:creator>Ahmad, Nazirah</dc:creator>
        <dc:format>application/pdf</dc:format>
        <dc:identifier>http://eprints.soton.ac.uk/66183/2451/P2503.pdf</dc:identifier>
        <dc:relation>http://eprints.soton.ac.uk/66183/</dc:relation>
	  </oai_dc:dc>
	</metadata>
</record>

While this one from Cambridge (DSpace) uses a single 'handle' as the identifier - which just links to the repository page. Also note that this 'item' actually consists of two files - a video and a transcript in MS Word:

<record>
	<header>
		<identifier>oai:www.dspace.cam.ac.uk:1810/29</identifier>
	</header>
	<metadata>
		<oai_dc:dc xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
			<dc:title>Interview with Professor Lucy Mair</dc:title>
			<dc:creator>Macfarlane, Alan</dc:creator>
			<dc:type>Video</dc:type>
			<dc:format>25088 bytes</dc:format>
			<dc:format>413196863 bytes</dc:format>
			<dc:format>application/msword</dc:format>
			<dc:format>application/octet-stream</dc:format>
			<dc:identifier>http://www.dspace.cam.ac.uk/handle/1810/29</dc:identifier>
			<dc:language>en_GB</dc:language>
		</oai_dc:dc>
	</metadata>
</record>

> 
> On a related note, regarding rights. As a faculty member, I regularly sign
> ETD approval forms.  At Tech, students have three options to choose from:
> (a) open and immediate access, (b) restricted to VT for 1 year, (c)
> withhold access completely for 1 year for patent/security purposes.  The
> current form does not allow student authors to address whether the
> full-text of their dissertation may be harvested for the purposes of
> full-text indexing in such indexes as Google or Summon, not does it allow
> them to restrict where copies are served from.  Similarly, the dc:rights
> section in the OAI-PMH records address copyright only.  In practice, Google
> crawls, indexes, and serves full-text copies of our dissertations.
> 


Of course, it is absolutely reasonable that some content either not be open or have an embargo period - in which case I'd expect it to either not be added to the repository, or added and protected by some security which prevents public access. I know that in some cases authors wish to delay release of the thesis in order to publish a book which may draw on the PhD research - and this can take several years, although different institutions set different limits on this. I also know of at least one case where a PhD contained information that was deemed so confidential, it was agreed never to release it (I wasn't allowed to know what the information was!)

In theory copyright could be seen as sufficient to cover the use of the full-text item by third parties - either Google is protected by fair use (in the US anyway) or not. Unfortunately (and this would certainly be true in the UK) - the only way of really discovering if you have a case against Google would be to take them to court. Google would say (as they did to the newspapers) "it's easy to request we don't index/cache your content - we obey robots.txt". Which sort of brings me back to the starting point of the project I'm working on - while two wrongs don't make a right, it seems to us that if repositories are not preventing Google (or others - for example notably CiteSeerX is in the business of crawling repositories http://csxstatic.ist.psu.edu/about/crawler) crawling/indexing/caching their content, then we hope that a non-profit, publicly funded, service should feel able to do the same in the interests of making the content of repositories more discoverable and more widely dissmeniated.

Owen


Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [log in to unmask]
Telephone: 0121 288 6936