LISTSERV 16.5 - CODE4LIB Archives

On Wed, May 12, 2004 at 03:57:47PM -0400, rob caSSon wrote:

> my main problem right now is the parsing....ugh, ugh...ugh....

Boy howdy!  It's like a rat's nest sometimes.

I haven't done HTML parsing in Perl for about a year now, but with
Python I am having good success with the microdom module from the
Twisted Framework[1].  The parseString() function, which returns a
largely DOM-like object, has an optional argument 'beExtremelyLenient',
which, if true, lets it handle a lot of crummy HTML.  Usually I run the
HTML file through tidy first, and in some cases I need to do a little
manual fix-up before that.  Something like this:
-------------------------Snippet-------------------------
import urllib
from mx.Tidy import tidy
from twisted.web import microdom

rawdoc = urllib.urlopen('http://foo.com/trash.html').read()
#-- fix up raw HTML here before passing to tidy, if necessary
tidydoc = tidy(rawdoc)[2]
domdoc = microdom.parseString(tidydoc, beExtremelyLenient=True)
#-- do stuff with quasi-DOM object
---------------------------------------------------------

I use Mozilla's DOM inspector as a visual guide to the DOM of
database-generated web pages, to work out the functions to get the
wanted parts of the document.

I'm using these techniques presently to scrape patent records from
the USPTO site.

Chuck

[1] http://www.twistedmatrix.com/