LISTSERV 16.5 - CODE4LIB Archives

The 25th (wow) issue of the Code4Lib Journal is now available at
http://journal.code4lib.org/issues/issues/issue25

Here is what you will find inside:

Editorial introduction: On libraries, code, support, inspiration, and
collaboration
Dan Scott
Reflections on the occasion of the 25th issue of the Code4Lib Journal:
sustaining a community for support, inspiration, and collaboration at the
intersection of libraries and information technology.

Getting What We Paid for: a Script to Verify Full Access to E-Resources
Kristina M. Spurgin
Libraries regularly pay for packages of e-resources containing hundreds to
thousands of individual titles. Ideally, library patrons could access the
full content of all titles in such packages. In reality, library staff and
patrons inevitably stumble across inaccessible titles, but no library has
the resources to manually verify full access to all titles, and basic URL
checkers cannot check for access. This article describes the E-Resource
Access Checker—a script that automates the verification of full access.
With the Access Checker, library staff can identify all inaccessible titles
in a package and bring these problems to content providers’ attention to
ensure we get what we pay for.

Opening the Door: A First Look at the OCLC WorldCat Metadata API
Terry Reese
Libraries have long relied on OCLC’s WorldCat database as a way to
cooperatively share bibliographic data and declare library holdings to
support interlibrary loan services. As curator, OCLC has traditionally
mediated all interactions with the WorldCat database through their various
cataloging clients to control access to the information. As more and more
libraries look for new ways to interact with their data and streamline
metadata operations and workflows, these clients have become bottlenecks
and an inhibitor of library innovation. To address some of these concerns,
in early 2013 OCLC announced the release of a set of application
programming interfaces (APIs) supporting read and write access to the
WorldCat database. These APIs offer libraries their first opportunity to
develop new services and workflows that directly interact with the WorldCat
database, and provide opportunities for catalogers to begin redefining how
they work with OCLC and their data.

Docker: a Software as a Service, Operating System-Level Virtualization
Framework
John Fink
Docker is a relatively new method of virtualization available natively for
64-bit Linux. Compared to more traditional virtualization techniques,
Docker is lighter on system resources, offers a git-like system of commits
and tags, and can be scaled from your laptop to the cloud.

A Metadata Schema for Geospatial Resource Discovery Use Cases
Darren Hardy and Kim Durante
We introduce a metadata schema that focuses on GIS discovery use cases for
patrons in a research library setting. Text search, faceted refinement, and
spatial search and relevancy are among GeoBlacklight’s primary use cases
for federated geospatial holdings. The schema supports a variety of GIS
data types and enables contextual, collection-oriented discovery
applications as well as traditional portal applications. One key limitation
of GIS resource discovery is the general lack of normative metadata
practices, which has led to a proliferation of metadata schemas and
duplicate records. The ISO 19115/19139 and FGDC standards specify metadata
formats, but are intricate, lengthy, and not focused on discovery.
Moreover, they require sophisticated authoring environments and cataloging
expertise. Geographic metadata standards target preservation and quality
measure use cases, but they do not provide for simple inter-institutional
sharing of metadata for discovery use cases. To this end, our schema reuses
elements from Dublin Core and GeoRSS to leverage their normative semantics,
community best practices, open-source software implementations, and
extensive examples already deployed in discovery contexts such as web
search and mapping. Finally, we discuss a Solr implementation of the schema
using a “geo” extension to MODS.

Ebooks without Vendors: Using Open Source Software to Create and Share
Meaningful Ebook Collections
Matt Weaver
The Community Cookbook project began with wondering how to take local
cookbooks in the library’s collection and create a recipe database. The
final website is both a recipe website and collection of ebook versions of
local cookbooks. This article will discuss the use of open source software
at every stage in the project, which proves that an open source publishing
model is possible for any library.

Within Limits: mass-digitization from scratch
Pieter De Praetere
The provincial library of West-Vlaanderen (Belgium) is digitizing a large
part of its iconographic collection. Due to various (technical and
financial) reasons no specialist software was used. FastScan is a set of
VBS-scripts that was developed by the author using off-the-shelf software
that was either included in MS Windows (XP & 7) or already installed
(imageMagick, Irfanview, littlecms, exiv2). This scripting package has
increased the digitization efforts immensely. The article will show what
software was used, the problems that occurred and how they were scripted
together.

A Web Service for File-Level Access to Disk Images
Sunitha Misra, Christopher A. Lee and Kam Woods
Digital forensics tools have many potential applications in the curation of
digital materials in libraries, archives and museums (LAMs). Open source
digital forensics tools can help LAM professionals to extract digital
contents from born-digital media and make more informed preservation
decisions. Many of these tools have ways to display the metadata of the
digital media, but few provide file-level access without having to mount
the device or use complex command-line utilities. This paper describes a
project to develop software that supports access to the contents of digital
media without having to mount or download the entire image. The work
examines two approaches in creating this tool: First, a graphical user
interface running on a local machine. Second, a web-based application
running in web browser. The project incorporates existing open source
forensics tools and libraries including The Sleuth Kit and libewf along
with the Flask web application framework and custom Python scripts to
generate web pages supporting disk image browsing.

Processing Government Data: ZIP Codes, Python, and OpenRefine
Frank Donnelly
While there is a vast amount of useful US government data on the web, some
of it is in a raw state that is not readily accessible to the average user.
Data librarians can improve accessibility and usability for their patrons
by processing data to create subsets of local interest and by appending
geographic identifiers to help users select and aggregate data. This case
study illustrates how census geography crosswalks, Python, and OpenRefine
were used to create spreadsheets of non-profit organizations in New York
City from the IRS Tax-Exempt Organization Masterfile. This paper
illustrates the utility of Python for data librarians and should be
particularly insightful for those who work with address-based data.

Indexing Bibliographic Database Content Using MariaDB and Sphinx Search
Server
Arie Nugraha
Fast retrieval of digital content has become mandatory for library and
archive information systems. Many software applications have emerged to
handle the indexing of digital content, from low-level ones such Apache
Lucene, to more RESTful and web-services-ready ones such Apache Solr and
ElasticSearch. Solr’s popularity among library software developers makes it
the “de-facto” standard software for indexing digital content. For content
(full-text content or bibliographic description) already stored inside a
relational DBMS such as MariaDB (a fork of MySQL) or PostgreSQL, Sphinx
Search Server (Sphinx) is a suitable alternative. This article will cover
an introduction on how to use Sphinx with MariaDB databases to index
database content as well as some examples of Sphinx API usage.

Solving Advanced Encoding Problems with FFMPEG
Josh Romphf
Previous articles in the Code4Lib Journal touch on the capabilities of
FFMPEG in great detail, and given these excellent introductions, the
purpose of this article is to tackle some of the common problems users
might face, dissecting more complicated commands and suggesting their
possible uses.

HathiTrust Ingest of Locally Managed Content: A Case Study from the
University of Illinois at Urbana-Champaign
Kyle R. Rimkus & Kirk M. Hess
In March 2013, the University of Illinois at Urbana-Champaign Library
adopted a policy to more closely integrate the HathiTrust Digital Library
into its own infrastructure for digital collections. Specifically, the
Library decided that the HathiTrust Digital Library would serve as a
trusted repository for many of the library’s digitized book collections, a
strategy that favors relying on HathiTrust over locally managed access
solutions whenever this is feasible. This article details the thinking
behind this policy, as well as the challenges of its implementation,
focusing primarily on technical solutions for “remediating” hundreds of
thousands of image files to bring them in line with HathiTrust’s strict
specifications for deposit. This involved implementing HTFeed, a Perl 5
application developed at the University of Michigan for packaging content
for ingest into Hathi Trust, and its many helper applications (JHOVE to
detect metadata problems, Exiftool to detect metadata issues and repair
missing image metadata, and Kakadu to create JP2000 files), as well as a
file format conversion process using ImageMagick. Today, Illinois has over
1600 locally managed volumes queued for ingest, and has submitted over 2300
publicly available titles to the HathiTrust Digital Library.