Print

Print


On Thu, May 13, 2004 at 08:27:01AM -0500, Eric Lease Morgan wrote:
> On May 12, 2004, at 9:44 PM, Chuck Bearden wrote:
>
> >>my main problem right now is the parsing....ugh, ugh...ugh....
> >
> >Boy howdy!  It's like a rat's nest sometimes.
>
> Screen scraping is for the birds!!

Albatross!  Albatross!  Bloody albatross!

Seriously, I do it as little as possible.  It happens to be the best way
forward for the project I'm involved in.  I don't have the luxury of
saying that I don't need data from the USPTO, and that I'll just wait
for a proper OAI server or XML-RPC interface to it.  When I compare it
to being able to harvest valid XML records in batches of 5000 from
PubMed, well then I do feel the albatross around my neck.

That being said, I have found that web pages of records generated from
databases and templates *are* indeed structured data, it's just that
they are (to borrow a PC phrase) "differently structured".  The HTML may
be crappy, but it's crappy in a consistent way, and it can be fixed
and navigated in a consistent way.  Necessity is the mother of invention
and the midwife of screen-scraping.

Chuck