Hello,
This is the answer to a question that was sent to me privately in response
to my previous posting. However, I thought others might be interested, so
I'm posting it here. I hope this is alright.
> What does this package do? Will it let me do a search on the web for PDF
> or HTML documents with Dublin Core metadata associated with the
> document?
The package currently retrieves two kinds of data:
1. XML data corresponding to the Open Archives Initiative
Protocol for Metadata Harvesting (OAI-PMH). For example,
if you enter the following URL in the address field of a browser:
http://oai.uni-tuebingen.de/OAIServer/oai2.aspx?verb=ListRecords&from=2006-02-01&metadataPrefix=oai_dc
you get a file of XML code where the names of the tags are based on
Dublin Core. It's not necessary to use a browser; it is also
possible to retrieve this data with a program using library
functions. The data is transmitted using the Hypertext Transfer
Protocol (HTTP).
2. Pica data transmitted using the Z39.50 family of protocols. Pica
is similar to USMARC and is used by many libraries in Germany and the
Netherlands.
> I want to find HTML pages that have DC metadata elements in their
> <head></head> segments. I do not want to get pages with DC elements ONLY
> in the <body></body> part of the document.
Parsing HTML is a fairly simple task. It wouldn't be difficult to
exclude the body of an HTML document from consideration. PDF is also
a text-based format, so parsing it would also not be too difficult.
It isn't clear to me how you want to perform this search. Do you have
certain criteria for finding documents to examine? It would be
possible to write a "web crawler" that searches through the entire
internet (in theory), testing all files that it recognizes as being in
HTML or PDF format, but it would be doing a lot of work for rather
meager results.
> If it does not do this, do you think you could develope a program that
> does?
I would need more exact specifications, but if I understand correctly
what you want, I could certainly program it.
Laurence
|