On Thu, May 13, 2004 at 11:23:21AM -0400, Walter Lewis wrote:
> Chuck Bearden wrote:
> >That being said, I have found that web pages of records generated from
> >databases and templates *are* indeed structured data, it's just that
> >they are (to borrow a PC phrase) "differently structured". The HTML may
> >be crappy, but it's crappy in a consistent way, and it can be fixed
> >and navigated in a consistent way. Necessity is the mother of invention
> >and the midwife of screen-scraping.
> I'm not sure why the midwife imagery triggered this thought but ...
> For how many useful targets would it be possible to define a consistent
> intermediate layer structure that would
> - handle a SRU/SRW search
> - transform it into an "native" database search
> - transform the results into an SRU/SRW friendly result set
> and still return them in a reasonable time?
> I'm not (necessarily) suggesting a centralized service that would do
> this (a la OCLC) but rather a set of protocols that I could drop into a
> locally managed site for targets that we choose to address in this
Are you thinking of a Perl DBI-like architecture, where the SRW or SRU
portion is analogous to the DBI, presenting a common interface to the
programmer, and the target-specific portion is analogous to the DBD::Pg
or DBD::MySQL portion, sort of like a driver for the db or website?
As for "reasonable time", that's much more difficult. I know from
experience that the USPTO patent record site responds quite slowly
(and doesn't support HTTP 1.1 with persistent connections, for that
matter). A patent search can get up to 50 patent numbers at a shot,
and from these numbers you can formulate a URL to the actual HTML-ified
record. But it's slow.
> Can the problem be abstracted sufficiently?
I would think so. It seems to me that the trick would be in the
differing semantics of the returned records from different sources,
unless each group of targets contains similar entities.
> Can we build in
> alerts to trigger actions when the structure of a given result doesn't
> match the pattern we've been expecting (i.e. site change alert) ?
That seems plausible. Something like a regression test for each website
target, as well as "pings" of some sort to see if the target is "up".
I'm going to hop back over to Eric's side for a moment, though, and
suggest that we don't want to undertake something that would "enable"
the vendors/providers to feel that they could drag their feet in
providing XML-RPC/SOAP/OAI/SR[WU] access to their data. We don't want
providers and vendors to say "Oh look, they are managing nicely without