On Mon, 2004-02-09 at 11:41, Walter Lewis wrote:
> One of the issues that I bumped into was that was passes for HTML in
> some email programs is [insert expletive of choice here]. Putting it in
> an XML data store was going to cause a tons of validation errors.
Some success might be found with TagSoup:
http://home.ccil.org/~cowan/XML/tagsoup/
It delivers SAX events from less than well-formed HTML. It doesn't
correct validation or style problems though... just provides a
consistent, well-formed interface to sloppy HTML.
An alternate approach, JTidy will do a good job of fixing many
validation problems, but it may fail depending on how bad the HTML is
http://jtidy.sourceforge.net/
TagSoup doesn't fail... "Just Keep[s] On Truckin'"
--
Kevin S. Clarke <[log in to unmask]>
Lane Medical Library, Stanford University
|