On Thu, May 13, 2004 at 08:27:01AM -0500, Eric Lease Morgan wrote: > On May 12, 2004, at 9:44 PM, Chuck Bearden wrote: > > >>my main problem right now is the parsing....ugh, ugh...ugh.... > > > >Boy howdy! It's like a rat's nest sometimes. > > Screen scraping is for the birds!! Albatross! Albatross! Bloody albatross! Seriously, I do it as little as possible. It happens to be the best way forward for the project I'm involved in. I don't have the luxury of saying that I don't need data from the USPTO, and that I'll just wait for a proper OAI server or XML-RPC interface to it. When I compare it to being able to harvest valid XML records in batches of 5000 from PubMed, well then I do feel the albatross around my neck. That being said, I have found that web pages of records generated from databases and templates *are* indeed structured data, it's just that they are (to borrow a PC phrase) "differently structured". The HTML may be crappy, but it's crappy in a consistent way, and it can be fixed and navigated in a consistent way. Necessity is the mother of invention and the midwife of screen-scraping. Chuck