LISTSERV 16.5 - CODE4LIB Archives

Eric Lease Morgan wrote:

> [snip] Screen scraping is for the birds!! If we, libraries, are so
> much about
> standards, then why do we tolerate interfaces to indexes that force us
> to do screen scraping. As soon as they change their interface you have
> all sorts of new work to do. "Just give me the data."

I would be the last one to disagree with the intent of Eric's
statement.  Consistent, standards-driven interfaces make content more
accessible for all.

Part of the challenge of the "dark web", is that it is full of unique
sets of data, both large and small, that have significance to specific
users in specific contexts and have a relatively custom interface to
suit it. (In the best of worlds with a standard set of HTML widgets) In
courses on cataloguing many, many years ago I was drawn to the insight
that publishers don't give a damn about librarians and consistent title
pages (and versos).  Especially do designers not give a damn.  I'm not
sure why I'd ever hoped that the web would be better.

We do screen scraping not from choice, but from necessity.

In my case, I do as little as possible, framing the target so that users
can carry on in the native environment.  All I really need are:
    a) the success/failure indicators (a result count when possible)
    b) a quick tweak to supply a base href
    c) and the ability to pass on permissions  (session ids, cookies etc.)

In short, if you can't get just the data (Eric's point, and the moral
high ground), touch the wrapper as little as possible and move on.
Various screen-scraping exercises have to find their own stopping point
in manipulating the content of the result set/HTML page.  The less you
touch the further it scales.

Perhaps that's the essential difference between a "federated" search and
one where you attempt to "unify" the result set.  The  approach I have
taken does not attempt to dedupe or re-sort or merge the disparate
results in any useful way.

Walter Lewis
Halton Hills