Print

Print


If you could create an index of just about any content you desired, then what content might you index?


Background

I have been working with a number of colleagues to index sets of content and provide enhanced services against search results. To date we have indexed more than 500,000 scholarly scientific journal articles on the topic of COVID-19. We have also indexed about 30,000 books from the venerable Project Gutenberg. Behind the scenes we have done very similar things to about half of a collection called Early English Books Online. We have also developed tools to enhance search results applied against Internet Archive Scholar.

This work is currently sponsored by two distinct organizations. The first is an organization called XSEDE, and hosted at the Pittsburgh Supercomputer Center. The second is Microsoft AI for Health. Using the resources of these two sponsors, we have more or less accomplished our project goals. Yes, there are many ways we can enhance our existing implementations, but those enhancements do not require high performance computing systems.

That said, we desire to use these computer systems to the maximum. We literally have spare cycles that we can spend creating additional indexes and enhanced services against search results.


The Question

Considering your particular library and clientele, if you could index just about anything, then what would it be? Examples might include:

  * all items manifested as format Y
  * all items on subject X
  * all items written between dates D and E
  * all items written by author Z
  * all open access materials collected in... I forget the name but it does exist
  * anything and everything you own and is in the HathiTrust
  * the sum of theses, dissertations, books, reports, or papers written at your institution

We might not be able to index the content of your library, but your answers to the question might apply to libraries in general, and that would be helpful.

Alternatively, can you think of a set of content which is freely available and applicable to a wide audience? The open access material alluded to above is a good candidate. So would something like the whole of arXiv.

There are some limitations regarding the types of content. For example, the content has to be full text in nature; a large set of metadata-only records won't really work. Content which is already widely used is better than content that is not. Content that is already digitized is a must; there is no time to digitize content. Ironically, the content does not have to be thoroughly associated with metadata; to some degree the project's system generates or extracts the necessary metadata. Finally, content that does not have to be scraped from the 'Net is better than not; you would be surprised how difficult it is to download all the articles of a single issue of an open access journal, let alone the whole run of a journal.

Our project has spare cycles, and it behooves us to use them to the fullest extent. We are looking for additional content. What might you suggest? If you can identify something, then collecting it, pre-processing it, indexing it, and providing access to the sum of all those things can be a real and tangible output. Do you have any ideas?

"When you have a hammer, everything begins to look like a nail."

--
Eric Lease Morgan
Navari Family Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame

574/485-6870