On Wed, May 12, 2004 at 03:57:47PM -0400, rob caSSon wrote:
> my main problem right now is the parsing....ugh, ugh...ugh....
Boy howdy! It's like a rat's nest sometimes.
I haven't done HTML parsing in Perl for about a year now, but with
Python I am having good success with the microdom module from the
Twisted Framework[1]. The parseString() function, which returns a
largely DOM-like object, has an optional argument 'beExtremelyLenient',
which, if true, lets it handle a lot of crummy HTML. Usually I run the
HTML file through tidy first, and in some cases I need to do a little
manual fix-up before that. Something like this:
-------------------------Snippet-------------------------
import urllib
from mx.Tidy import tidy
from twisted.web import microdom
rawdoc = urllib.urlopen('http://foo.com/trash.html').read()
#-- fix up raw HTML here before passing to tidy, if necessary
tidydoc = tidy(rawdoc)[2]
domdoc = microdom.parseString(tidydoc, beExtremelyLenient=True)
#-- do stuff with quasi-DOM object
---------------------------------------------------------
I use Mozilla's DOM inspector as a visual guide to the DOM of
database-generated web pages, to work out the functions to get the
wanted parts of the document.
I'm using these techniques presently to scrape patent records from
the USPTO site.
Chuck
[1] http://www.twistedmatrix.com/
|