Roy Tennant wrote:
> [snip] Let's think
> imaginatively about how we might be able to take what we can easily get
> and improve it with information from other sources, such as Walter's
> good idea about snatching RSS feeds (good), or some kind of software
> manipulation such as I suggested (less good).
My idea would only have value if the site in question issues an RSS
feed, which is a separate set of issues. And as RSS is rarely issued
historically, and is most often spun up from better structured sites in
the first place, perhaps it doesn't take us very far. At best it could
only be among the suite of strategies that need to be deployed.
> Therefore, if we wish to do this, we _must_ come up
> with an infrastructure that can accommodate no metadata whatsoever.
> That, my friends, is life. It's also why the "semantic web" is a
> complete non-starter. So the sooner we start dealing with reality, the
> better off we'll all be.
Agreed. And please don't think I don't think this is a road we don't
have to go down, and that I'm not prepared to walk it with the rest of
you. There is a whole discipline within librarianship the essence of
which is creating excellent metadata out of content formatted by others
who follow only loose conventions (and in some cases deliberately flout
those conventions). AACRx anyone? On the other hand, I suspect some of
the set of issues presented at the next level in might make a hardened
Serials Librarian break down and cry.
Having said that, a reasoned, abstracted architecture for deriving
context from content that deals with life as it is presented is an
absolute requirement. And it should be breeze to factor in the semantic
web on the rare occasions that it wanders by.
Parenthetically, for my sins, I said I'd take on the digitizing of about
35 years of one organization's newsletters. OCR is the easy part. The
data will be marked up with TEI/XML but if Eric were to come indexing,
it is likely that all that he would see (today) would be a set of HTML
pages in close formation. I don't, yet, have any configuration on that
site for the exposure of the underlying XML files. How many others are
in the same boat?