rob caSSon wrote: >[snip]its very incomplete, but if anyone feels like taking a look, here it is: >there is a tarball with the cgis, php scripts, and example result lists >for the dbs > >i'm mainly posting this in case someone can figure out how better to >parse some of the database search result lists (jstor has been >particularly problematic)....oh, how i long for xml output..... > >the current scrapers i've got are ebsco, jstor, dataware (ohio-centric), >and the III catalog....i'll tackle lexis, and a few others in the near >future..... > >anyway, comments/help appreciated....i'm using perl, lwp, >html::treebuilder, cgi, and uri, and my perl is rusty at best.... > I did a couple of proof-of-concept things over the winter using php and curl, which is available on a number of platforms. I can't comment on how well it performs relative to the configuration you're using but libcurl is a smart, reasonably well supported toolset. In my configuration, I attempted to derive a number of results from the first screen back using a simple regular expression. Failing that I simply put up a "success" flag, largely based on the *absence* of the target's variation on "no records could be found..." message. In terms of the navigation metaphor [note to those using the archives: this URL is subject to change]: try http://roy.halinet.on.ca/GreatLakes/Search/search.php A search like "Chicora" returns a meaningful, but not overwhelming set of results from a number of targets including BGSU in Ohio. When I tackled Ebsco, I ran into issues of site authentication via cookies that were passed to the search gateway but not on to the client browser. Peter Binkley, at the University of Alberta recommended a proxy configuration to balance off this issue. Essentially those connections would have to continue to operate inside a search gateway proxied session. I don't know how the perl tools stack up in terms of parallel search streams. The php/curl combination is purely serial and the last targets will time out if there is a tardy responder in the middle of the serial queue. Art Rhyno, at the University of Windsor, suggested a parallel approach might be possible in a Cocoon environment. This has the advantage of passing all the inbound HTML pages through JTidy and giving you the XHTML/XML compliant input stream you wanted (in most cases, even when the output from the target was some distance from compliance). Walter Lewis Halton Hills