As Walter pointed out, you can get an XHTML/XML stream from most web sites with Tidy <http://tidy.sourceforge.net>, and there are options for Java, Perl and Python, with ways of running it all as an external program for other environments. Unfortunately, XHTML still leaves you at the mercy of trying to invoke meaning from presentational output. On the plus side, it means you can use XSLT for drilling down to what you want, e.g.:

<xsl:template match="a[contains(@class,'medium-text')]">

In this case, ACM's Digital Library uses the class attribute to put results listings in a certain font size. You can leverage this, but it's not very intuitive and is probably as fragile as other screen-scraping approaches.

Peter Binkley, who is helping Walter and I in this, has suggested including Schematron <http://www.schematron.com/> in the mix. This might simplify the process considerably, but I haven't tried it yet.In case anyone is interested, the "transformer", in Cocoon parlance, which produces the XHTML and passes the cookie/session information is available at:

http://librarycog.uwindsor.ca/wibs/seth - use (seth/seth4u for userid/password)

The readme file has the gritty details. I don't know if we will ever see the vendors provide XML output that conforms to a documented schema or DTD, but even using constructs like "div" and "class" could be really helpful for this kind of thing.

art