LISTSERV 16.5 - CODE4LIB Archives

A belated comment, with a couple of strategies to reduce the maintenance
required for a screen-scraping system:

I'm doing some screen-scraping these days for a one-off project, using
Cocoon to fetch HTML and tidy it into XHTML, then a stylesheet to pull out
the bits of data that I want. It's all very easy to put together as a
pipeline. The high-maintenance part, if this were a long-term commitment,
would be the xpaths that select the HTML elements where the relevant data
sits, since if the source page gets redesigned the xpaths will change. It
would not be hard to develop a stylesheet that would display the source code
of an XHTML page with xpaths or xpath snippets for each HTML element,
something like this:

<html>
        <body>
                <table xpath="/html/body/table[1]">
                        <tr xpath="/html/body/table[1]/row[1]">

Etc. etc. Perhaps such a stylesheet already exists. You could then paste
those xpaths into the stylesheet that extracts the data. You would probably
want to allow for other methods of selection than position, using class
attributes etc. as selectors. It wouldn't be possible to cover every
eventuality, but it would at least make the maintenance of the stylesheet
easier - though probably not easy enough to hand to someone without at least
basic XSL skills.

Finally, we need to know when the input format changes. We can get part way
there by building validation into the pipeline, probably after the
extraction stage, to find e.g. empty elements that shouldn't be empty. It
should also be possible to specify some signatures for a given input format,
that are likely to break if the site is redesigned. E.g something like
"count(/html/body/table[1]/tr) = 3" would become false if they change the
number of rows in the first table. If we specified say 3 of these rules for
each input format, and tested them as part of the data extraction
stylesheet, we could at least set up an automatic notification system to
alert us when the input format changes. It will take some experimentation to
determine what kinds of signatures work best.

Peter

Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
4-30 Cameron Library
University of Alberta Libraries
Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243
e-mail: [log in to unmask]



> -----Original Message-----
> From: Roy Tennant [mailto:[log in to unmask]]
> Sent: Wednesday, March 03, 2004 09:32 PM
> To: [log in to unmask]
> Subject: Re: [CODE4LIB] index of open access journals
>
>
> These comments are all good ones, and those of you who know
> me (and Walter is in that number) know that I'm nothing if
> not practical. In my defense I can only put forward the fact
> that I suggested a "profile" idea which would hopefully
> abstract to at least one level the kind of maintenance that
> would be required. That is, I would not want to go into (name
> your favorite language here) code every time a page changed
> that we were basically screen-scraping. That's a recipe for
> disaster. Rather, I was hoping we could come up with a method
> that would allow virtually anyone (not just code jockeys) to
> update some key elements that the program would then use to
> properly process the page. This of course would still rely
> upon the very tenuous fact that the typical journal HTML
> makes any sense whatsoever.
>
> But that's just one level of what I was after. I was also
> just trying to make the _general_ point that we are not
> necessarily limited to exactly what we find _in_situ_. We
> can, with imagination and the right tools, manipulate what is
> there to our advantage. And that was really the point I was
> trying to make (clumsily, admittedly). Let's think
> imaginatively about how we might be able to take what we can
> easily get and improve it with information from other
> sources, such as Walter's good idea about snatching RSS feeds
> (good), or some kind of software manipulation such as I
> suggested (less good).
>
> Finally, as a practical man, I realize that we will never be
> successful if we rely on journal publishers to do metadata,
> or page coding, the way we wish them to. I mean, we may as
> well just give up now if that is what is required. Therefore,
> if we wish to do this, we _must_ come up with an
> infrastructure that can accommodate no metadata whatsoever.
> That, my friends, is life. It's also why the "semantic web"
> is a complete non-starter. So the sooner we start dealing
> with reality, the better off we'll all be. Roy
>
>
> On Mar 3, 2004, at 6:06 PM, Dinberg Donna wrote:
>
> > Responding to Roy's interesting suggesting and being mindful of
> > Walter's/Cliff's cautions, my tale of woe in the hard-copy
> world was
> > always wanting a way to get at that "In Brief" stuff
> without having to
> > eyeball the
> > journal.  Today, online "In Brief" notices still need to be found
> > efficiently by some of us for various reasons.  Anything
> that improves
> > retrieval of these smaller items would be welcomed by me.  You are
> > correct,
> > Walter, that the best federated search results are those
> resulting from
> > standards-based procedures; but I like Roy's idea, too, for
> the other,
> > smaller stuff.
> >
> > Back to lurking now.
> > Din.
> >
> > Donna Dinberg
> > Systems Librarian/Analyst
> > Virtual Reference Canada
> > Library and Archives Canada
> > Ottawa, ON   K1A 0N4
> > Voice:  613-995-9227
> > E-mail:  [log in to unmask]
> >
> > <Opinions all mine, of course.  Usual disclaimers apply.>
> >
> >
> >
> >> -----Original Message-----
> >> From: Walter Lewis [mailto:[log in to unmask]]
> >> Sent: Wednesday, March 03, 2004 7:18 PM
> >> To: [log in to unmask]
> >> Subject: Re: [CODE4LIB] index of open access journals
> >>
> >>
> >> Roy Tennant wrote:
> >>
> >>> [snip] There may be other ways to leverage more
> information out of
> >>> what we're indexing. For example, a number of journals have
> >> sections,
> >>> such as "In Brief" from D-Lib Magazine [snip] It would of
> >> course take
> >>> more work to both setup and maintain,
> >>> but the result would be better.
> >>
> >> I am reminded of a piece of advice Cliff Lynch offered at
> an Access
> >> conference I attended in the early days of the web ('95 in
> >> Fredericton) where he talked about the fundamental fragility of
> >> programs that supplied web content by screen scraping vt100
> >> interfaces.
> >      <snip>
> > < The best federated search
> >> results, IMHO,  hang on standard search and result protocols like
> >> Z39.50 where the underlying structure is abstracted into
> standardized
> >> access points and published record syntax.
> >
>