On Thu, May 13, 2004 at 08:27:01AM -0500, Eric Lease Morgan wrote:
> On May 12, 2004, at 9:44 PM, Chuck Bearden wrote:
>
> >>my main problem right now is the parsing....ugh, ugh...ugh....
> >
> >Boy howdy! It's like a rat's nest sometimes.
>
> Screen scraping is for the birds!!
Albatross! Albatross! Bloody albatross!
Seriously, I do it as little as possible. It happens to be the best way
forward for the project I'm involved in. I don't have the luxury of
saying that I don't need data from the USPTO, and that I'll just wait
for a proper OAI server or XML-RPC interface to it. When I compare it
to being able to harvest valid XML records in batches of 5000 from
PubMed, well then I do feel the albatross around my neck.
That being said, I have found that web pages of records generated from
databases and templates *are* indeed structured data, it's just that
they are (to borrow a PC phrase) "differently structured". The HTML may
be crappy, but it's crappy in a consistent way, and it can be fixed
and navigated in a consistent way. Necessity is the mother of invention
and the midwife of screen-scraping.
Chuck
|