We are one of those institutions that did this -negotiated for lots of content YEARS ago (before the providers really knew what
they or we were in for.....)
We have locally loaded records from the ISI databases, INSPEC, BIOSIS, and the Department of Energy (as well as from full-text
publishers, but that is another story and system entirely.) Aside from the contracts, I can also attest to the major amount of
work it has been. We have 95M bibliographic records, stored in > 75TB of disk, and counting. Its all running on SOLR, with a local interface
and the distributed aDORe repository on backend. ~ 2 FTE keep it running in production now.
Over the 15 years we've been loading this, we've had to migrate it 3 times, and deal with all the dirty metadata, duplication,
and other difficult issues around scale and lack of content provider "interest" in supporting the few of us who do this kind of stuff.
We believe we have now achieved a standardized format (MPEG-21 DIDL and MARCXML with some other standards mixed in) and accessible
through protocol-based services (OpenURL, REST, OAI-PMH), etc. so that we hope we won't have to mess with the data records
again and can move on to other more interesting things.
It is nice to have, very fast - very much beats federated search - and allows us (finally) to begin to build neat services (for licensed users only!) Data mining?
Of course a goal, but talk about sticky areas of contract negotiation. And in the end, you never have everything someone
needs when they want all content about something specific. And yes, local loading is expensive, for a lot of reasons.
Ex Libris, Summon, etc. are now getting into the game from this angle. We will so feel their pain, but I hope technology
and content provider engagement have improved to make it a bit easier for them! And it definitely adds a level of usability
much improved over federated search.
My .02,
Miriam Blake
Los Alamos National Laboratory Research Library
On 6/30/10 3:20 PM, "Rosalyn Metz" <[log in to unmask]> wrote:
i know that there are institutions that have negotiated contracts for just
the content, sans interface. But those that I know of have TONS of money
and are using a 3rd party interface that ingests the data for them. I'm not
sure what the terms of that contract were or how they get the data, but it
can be done.
On Wed, Jun 30, 2010 at 5:07 PM, Cory Rockliff <[log in to unmask]>wrote:
> We're looking at an infrastructure based on Marklogic running on Amazon
> EC2, so the scale of data to be indexed shouldn't actually be that big of an
> issue. Also, as I said to Jonathan, I only see myself indexing a handful of
> highly-relevant resources, so we're talking millions, rather than 100s of
> millions, of records.
>
>
> On 6/30/2010 4:22 PM, Walker, David wrote:
>
>> You might also need to factor in an extra server or three (in the cloud or
>> otherwise) into that equation, given that we're talking 100s of millions of
>> records that will need to be indexed.
>>
>>
>>
>>> companies like iii and Ex Libris are the only ones with
>>> enough clout to negotiate access
>>>
>>>
>> I don't think III is doing any kind of aggregated indexing, hence their
>> decision to try and leverage APIs. I could be wrong.
>>
>> --Dave
>>
>> ==================
>> David Walker
>> Library Web Services Manager
>> California State University
>> http://xerxes.calstate.edu
>> ________________________________________
>> From: Code for Libraries [[log in to unmask]] On Behalf Of Jonathan
>> Rochkind [[log in to unmask]]
>> Sent: Wednesday, June 30, 2010 1:15 PM
>> To: [log in to unmask]
>> Subject: Re: [CODE4LIB] DIY aggregate index
>>
>> Cory Rockliff wrote:
>>
>>
>>> Do libraries opt for these commercial 'pre-indexed' services simply
>>> because they're a good value proposition compared to all the work of
>>> indexing multiple resources from multiple vendors into one local index,
>>> or is it that companies like iii and Ex Libris are the only ones with
>>> enough clout to negotiate access to otherwise-unavailable database
>>> vendors' content?
>>>
>>>
>>>
>> A little bit of both, I think. A library probably _could_ negotiate
>> access to that content... but it would be a heck of a lot of work. When
>> the staff time to negotiations come in, it becomes a good value
>> proposition, regardless of how much the licensing would cost you. And
>> yeah, then the staff time to actually ingest and normalize and
>> troubleshoot data-flows for all that stuff on the regular basis -- I've
>> heard stories of libraries that tried to do that in the early 90s and it
>> was nightmarish.
>>
>> So, actually, I guess i've arrived at convincing myself it's mostly
>> "good value proposition", in that a library probably can't afford to do
>> that on their own, with or without licensing issues.
>>
>> But I'd really love to see you try anyway, maybe I'm wrong. :)
>>
>>
>>
>>> Can I assume that if a database vendor has exposed their content to me
>>> as a subscriber, whether via z39.50 or a web service or whatever, that
>>> I'm free to cache and index all that metadata locally if I so choose? Is
>>> this something to be negotiated on a vendor-by-vendor basis, or is it an
>>> impossibility?
>>>
>>>
>>>
>> I doubt you can assume that. I don't think it's an impossibility.
>>
>> Jonathan
>> ---
>> [This E-mail scanned for viruses by Declude Virus]
>>
>>
>>
>>
>>
>
>
> --
> Cory Rockliff
> Technical Services Librarian
> Bard Graduate Center: Decorative Arts, Design History, Material Culture
> 18 West 86th Street
> New York, NY 10024
> T: (212) 501-3037
> [log in to unmask]
>
> ---
> [This E-mail scanned for viruses by Declude Virus]
>
|