Roy Tennant wrote: > [snip] Let's think > imaginatively about how we might be able to take what we can easily get > and improve it with information from other sources, such as Walter's > good idea about snatching RSS feeds (good), or some kind of software > manipulation such as I suggested (less good). My idea would only have value if the site in question issues an RSS feed, which is a separate set of issues. And as RSS is rarely issued historically, and is most often spun up from better structured sites in the first place, perhaps it doesn't take us very far. At best it could only be among the suite of strategies that need to be deployed. > Therefore, if we wish to do this, we _must_ come up > with an infrastructure that can accommodate no metadata whatsoever. > That, my friends, is life. It's also why the "semantic web" is a > complete non-starter. So the sooner we start dealing with reality, the > better off we'll all be. Agreed. And please don't think I don't think this is a road we don't have to go down, and that I'm not prepared to walk it with the rest of you. There is a whole discipline within librarianship the essence of which is creating excellent metadata out of content formatted by others who follow only loose conventions (and in some cases deliberately flout those conventions). AACRx anyone? On the other hand, I suspect some of the set of issues presented at the next level in might make a hardened Serials Librarian break down and cry. Having said that, a reasoned, abstracted architecture for deriving context from content that deals with life as it is presented is an absolute requirement. And it should be breeze to factor in the semantic web on the rare occasions that it wanders by. Parenthetically, for my sins, I said I'd take on the digitizing of about 35 years of one organization's newsletters. OCR is the easy part. The data will be marked up with TEI/XML but if Eric were to come indexing, it is likely that all that he would see (today) would be a set of HTML pages in close formation. I don't, yet, have any configuration on that site for the exposure of the underlying XML files. How many others are in the same boat? Walter Lewis Halton Hills