LISTSERV 16.5 - CODE4LIB Archives

I have learned how to exploit SQLite so it supports full text indexing, and I can see this being used in many different ways here in Library Land.

This past weekend I used OAI to harvest the totality of ITAL (Information Technologies and Libraries). This resulted in 700+ PDF files. I then fed the whole of the archive to my Reader, and this produced a "study carrel". [1, 2] Among many other things, the carrel contains an SQLite database file. On my desktop computer I augmented the database to include two additional tables. The first includes the plain text of each article. The second includes fields for just about every other field in the database as well as the full text. This second table is an SQLite-ism called "FTS5", and it supports fielded search, stemming, Boolean logic, phrase search, and relevance ranked output. In short FTS5 functions very similarly to Solr sans many of Solr's expenses: Java, a server, a network, lots o' configuration, etc. FTS5 is not a replacement for Solr, but a viable option.

I have made my efforts temporarily available at the following URL, complete with all the PDF files, all of the plain text files, the database file, a rudimentary Bash interface to query the database, and an "cookbook" of sample queries:

http://dh.crc.nd.edu/tmp/ital-index.zip

Any number of interfaces in any number of languages could be written against the database, even a Webbed interface, but I created a Bash interface so I could exploit standard output and piping. Sample uses of the interface include:

* find articles about MARC - ./bin/search.sh lines keyword:marc

* find articles by an author - ./bin/search.sh lines author:truitt

* find a phrase - ./bin/search.sh lines "marc must die"

* find articles and output bibliographics as CSV - ./bin/search.sh csv keyword:marc > articles.csv

Here's a tricky one. Find articles and concatenate the full text of each article to a single file for further analysis ("reading"):

* ./bin/search.sh tabs marc | while read; do FILE=$(cut -f11); cat $FILE; done > ./results.txt

What is really cool (or "kewl") is how lightweight and transferable the whole thing is. SQLite is platform-independent. It requires no server. It requires no network. On the down-side, you do need to know SQL, mostly.

I can imagine a number of scenarios for librarianship:

* Digitize collection. Create database describing it. Add
collection to database. Give away database complete with desktop
scripts to use it.

* Create subset of catalog's MARC records. Create database of
subset. Give away the resulting "personal catalogs".

* Do a cool search against a (open access) journal archive.
Download articles. Do natural language processing against the
articles to enhance bibliographic description. Add the whole
thing to a database. Give away database.

Without a doubt, SQLite is my current database favorite.

[1] Reader - https://distantreader.org
[2] ITAL study carrel - https://library.distantreader.org/carrels/ital/index.htm

--
Eric Lease Morgan
Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
Hesburgh Libraries

University of Notre Dame
250E Hesburgh Library
Notre Dame, IN 46556
o: 574-631-8604
e: [log in to unmask]
w: cds.library.nd.edu