On Wed, May 12, 2004 at 03:57:47PM -0400, rob caSSon wrote: > my main problem right now is the parsing....ugh, ugh...ugh.... Boy howdy! It's like a rat's nest sometimes. I haven't done HTML parsing in Perl for about a year now, but with Python I am having good success with the microdom module from the Twisted Framework[1]. The parseString() function, which returns a largely DOM-like object, has an optional argument 'beExtremelyLenient', which, if true, lets it handle a lot of crummy HTML. Usually I run the HTML file through tidy first, and in some cases I need to do a little manual fix-up before that. Something like this: -------------------------Snippet------------------------- import urllib from mx.Tidy import tidy from twisted.web import microdom rawdoc = urllib.urlopen('http://foo.com/trash.html').read() #-- fix up raw HTML here before passing to tidy, if necessary tidydoc = tidy(rawdoc)[2] domdoc = microdom.parseString(tidydoc, beExtremelyLenient=True) #-- do stuff with quasi-DOM object --------------------------------------------------------- I use Mozilla's DOM inspector as a visual guide to the DOM of database-generated web pages, to work out the functions to get the wanted parts of the document. I'm using these techniques presently to scrape patent records from the USPTO site. Chuck [1] http://www.twistedmatrix.com/