LISTSERV 16.5 - CODE4LIB Archives

On Nov 21, 2022, at 5:12 PM, RD B <[log in to unmask]> wrote:

> We (Kelvin Smith Library, Case Western Reserve University) are considering
> the ProQuest TDM Studio:
> 
>   https://about.proquest.com/en/products-services/TDM-Studio/
> 
> I was curious if anyone here had any direct experience with the system they
> could share, or if there were alternatives that the community recommends
> and why.
> 
> -- 
> R. David Beales - [log in to unmask] - 732-299-0390
> Library, Earth, Sol System, Orion-Cygnus Arm of the Milky Way Galaxy,
> Laniakea Supercluster


A couple of years ago I experimented with TDM Studio, and I can report that it worked as advertised. 

More specifically, Studio worked like the handful of similar services. One from Lexis/Nexus, one from JSTOR, and the one from the HathiTrust. What does that mean? It means a person:

  1. searches the given collection
  2. results are subsetted to a secure location
  3. using tools and APIs provided by the vendor,
     analysis is done against the results
  4. results are exported
  5. repeat until done

Many times the tools and API require a working knowledge of the Python programming language, and then there is the curve of learning the specific tools. The tools usually include a number of modeling techniques: bibliography creation, ngram analysis, topic modeling, and full text searching. After working in this area for a more than a few years now, these techniques ought to be considered rudimentary, and additional techniques such as the application of grammars, semantic indexing, and collocations ought to be included.

All of the vendors have their hands tied by contract and copyright. Each vendor has made agreements with publishers not to freely share content, but it is not possible to do text mining, natural language processing, nor data science with words sans the content. Consequently each vendor implements a variation on Step #2, above. The process would be a h3ll of a lot easier if the student, researcher, or scholar could:

  1. search content
  2. select items of interest
  3. download selected items sans click,
     save, click, save, click, save, etc.
  4. use a wide variety of GUI tools,
     command-line tools, or programming
     languages to do the analysis

Here licensing is probably the limiting factor, not copyright. 

Do I know of open source alternatives? No, not really, but I hope my Reader addresses some of these problems. Given a set of files of just about any number and just about any ilk and saved in a local folder/directory, the Reader:

  * converts the files to plain text
  * does all sorts of feature extraction against the result
  * distills the features into a data set (a "study carrel")
  * provides the means to compute against the data set, and
    the computing could done with GUI tools (like OpenRefine
    or AntConc), command-line tools (like grep or jq), or
    programming libraries (like Python's NLTK or spaCy)

In the end the Reader supports all of the modeling techniques alluded to above as well as a few others. Consequently, a person can search any vendor for content of interest, download the results (through click, click, click), and do analysis against the result.

Like all software, the Reader is never done and ought to be considered beta-ware, See:

  https://distantreader.org

HTH

P.S. David, nice signature.

--
Eric Lease Morgan
Navari Family Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame

574/631-8604
https://cds.library.nd.edu