The approach described by Peter is also how I have been thinking about
this. If the content is only available in HTML, it's hard to beat Tidy for
doing a passable job of getting content into XHTML, and from there,
stylesheets can work with leverage whatever structure is available, such as
it is, and subject to the problems that Peter flagged. One other building
block that might be useful in this context is the Composite Capabilities /
Preferences Profile (CC/PP), see
<http://www.webstandards.org/learn/askw3c/feb2004.html>. One section of
this document states:
"XHTML is powerful because it is XML, or so we've been taught. And the
power of XML is often demonstrated through the use of XSLT, the
transformation language for XML. Combining the possibility to transform
XHTML content through XSLT with the flexibility and accuracy provided by
CC/PP makes it possible to transform hypertext content on-the-fly beyond
what style sheets already allow. You can show tabular content in a linear
fashion for agents that can't handle tables, transform a long XHTML
document with many sections in an SVG slideshow and so on, with very few
limitations."
Maybe CC/PP could be used to profile the layout needed to better expose
content to harvesters and other applications.
art
|