On Feb 27, 2012, at 10:51 AM, Godmar Back wrote:
> On Mon, Feb 27, 2012 at 8:31 AM, Diane Hillmann <[log in to unmask]>wrote:
>> On Mon, Feb 27, 2012 at 5:25 AM, Owen Stephens <[log in to unmask]> wrote:

>>> This issue is certainly not unique to VT - we've come across this as part
>>> of our project. While the OAI-PMH record may point at the PDF, it can
>> also
>>> point to a intermediary page. This seems to be standard practice in some
>>> instances - I think because there is a desire, or even requirement, that
>> a
>>> user should see the intermediary page (which may contain rights
>> information
>>> etc.) before viewing the full-text item. There may also be an issue where
>>> multiple files exist for the same item - maybe several data files and a
>> pdf
>>> of the thesis attached to the same metadata record - as the metadata via
>>> OAI-PMH may not describe each asset.
>> This has been an issue since the early days of OAI-PMH, and many large
>> providers provide such intermediate pages (, for instance). The
>> other issue driving providers towards intermediate pages is that it allows
>> them to continue to derive statistics from usage of their materials, which
>> direct access URIs and multiple web caches don't.  For providers dependent
>> on external funding, this is a biggie.
> Why do you place direct access URI and multiple web caches into the same
> category? I follow your argument re: usage statistics for web caches, but
> as long as the item remains hosted in the repository direct access URIs
> should still be counted (provided proper cache-control headers are sent.)
> Perhaps it would require server-side statistics rather than client-based GA.

I'd agree -- if you can't get good statistics from direct linking, something's wrong with the methods you're using to collect usage information.  Google Analytics and similar tools might produce pretty reports, but they're really meant for tracking web sites and won't work when someone has javascript turned off, has specifically blacklisted the analytics server, or on anything that's not HTML.

You *really* need to analyze the server logs directly, as you can't be sure that all access is going to go through the intermediate 'landing pages' or that it'd be tracked even if they did.


I admit, the stuff I'm serving is a little different than most people on this list, but we also have the issue that the collections are so large that we don't want people retrieving the files unless they really need them.  We serve multiple TB per day -- I'd rather a person figure out if they want a file *before* they retrieve it, rather than download a few GB of data and find out it won't serve their purposes.

It might not help our 'look how much we serve!' metrics to justify our funding, but it helps keep our costs down, and I personally believe it helps with good will in our designated community as they don't spend a day (or more) downloading only to find it's not what they thought.  (and it fits in with Ranganathan's 4th law better than saving them from an extra click)