LISTSERV 16.5 - CODE4LIB Archives

What are some of the more popular and useful bibliographic databases/indexes with well-structured output?

If it were easy (trivial) for our readers to gets sets of well-structured data out of our bibliographic databases, then it would be relatively easy for us to write software enabling readers to use and understand — evaluate — their data. What databases/indexes lend themselves to this solution? Let me elaborate.

JSTOR’s Data For Research service provides complete access to the totality of JSTOR, sans the articles themselves, unless you are auathorized. [1] A person can search JSTOR and then request a data dump compete with citations, keyword frequencies, and n-grams. This data can then be used to create a report — like a timeline or tag clouds or concordances — illustrating the characteristics of the found set. About six months ago I wrote a program, the beginnings of such a report. [2]

Suppose a reader diligently used something like Endnote, Zotero, or RefWorks to save and manage their bibliographic citations of interest. If the reader were to export some or all of their bibliographic data to a file, then the result would be well-structured and computer readable. Things like titles, authors, keywords/subjects, maybe abstracts, and citations would be neatly delimited. If this file were read by a second computer program new views of the data could be manifested. Again, a timeline could be created. Wordclouds could be created. An analysis could be done against the data to determine frequent authors. Relationships between authors might be able to be exposed. All of this would assist the reader in evaluating their found set.

Through the use of APIs I can search things like WorldCat, the HathiTrust, or the Internet Archive. The result could be (for better or for worse) MARC records. Again, analysis could be done against this data not to find information (that has already been done), but rather to evaluate the data — look for patterns and anomalies.

Put another way, instead of trying to force people to do the best and most perfect bibliographic search, allow them to do broad searches and then provide supplementary tools enabling the reader to examine the results. It is not about find. It is about use & understand.

I prefer XML to other data structures, but I will not necessarily limit myself to XML. What information sources would you suggest I use? Here is a short, unordered list:

* JSTOR Data For Research Data
* Zotero (RDF) XML output
* WorldCat, HathiTrust, Internet Archive

After I write the “search results evaluation tool”, I will then go to the next step and provide tools for the “distant reading” of individual items á la my PDF2TXT application. [3]

We here in libraries can no longer just give people access to information because people have more access than they know what to do with. Instead, I think an opportunity exists for us to provide tools for evaluating the information they have so they can use & understand it. Call it “scalable, computer-supplemented information literacy”.

[1] Data For Research - http://dfr.jstor.org
[2] JSTOR Tool — http://dh.crc.nd.edu/sandbox/jstor-tool/
[3] PDF2TXT - http://dh.crc.nd.edu/sandbox/pdf2txt.cgi

—
Eric Morgan
University of Notre Dame