rob caSSon wrote:
>[snip]its very incomplete, but if anyone feels like taking a look, here it is:
>there is a tarball with the cgis, php scripts, and example result lists
>for the dbs
>i'm mainly posting this in case someone can figure out how better to
>parse some of the database search result lists (jstor has been
>particularly problematic)....oh, how i long for xml output.....
>the current scrapers i've got are ebsco, jstor, dataware (ohio-centric),
>and the III catalog....i'll tackle lexis, and a few others in the near
>anyway, comments/help appreciated....i'm using perl, lwp,
>html::treebuilder, cgi, and uri, and my perl is rusty at best....
I did a couple of proof-of-concept things over the winter using php and
curl, which is available on a number of platforms. I can't comment on
how well it performs relative to the configuration you're using but
libcurl is a smart, reasonably well supported toolset. In my
configuration, I attempted to derive a number of results from the first
screen back using a simple regular expression. Failing that I simply
put up a "success" flag, largely based on the *absence* of the target's
variation on "no records could be found..." message.
In terms of the navigation metaphor [note to those using the archives:
this URL is subject to change]: try
A search like "Chicora" returns a meaningful, but not overwhelming set
of results from a number of targets including BGSU in Ohio.
When I tackled Ebsco, I ran into issues of site authentication via
cookies that were passed to the search gateway but not on to the client
browser. Peter Binkley, at the University of Alberta recommended a proxy
configuration to balance off this issue. Essentially those connections
would have to continue to operate inside a search gateway proxied session.
I don't know how the perl tools stack up in terms of parallel search
streams. The php/curl combination is purely serial and the last targets
will time out if there is a tardy responder in the middle of the serial
queue. Art Rhyno, at the University of Windsor, suggested a parallel
approach might be possible in a Cocoon environment. This has the
advantage of passing all the inbound HTML pages through JTidy and giving
you the XHTML/XML compliant input stream you wanted (in most cases, even
when the output from the target was some distance from compliance).