LISTSERV 16.5 - CODE4LIB Archives

Hi David, Eric ~



I'm going to step in and offer a few clarifying comments about Constellate (https://constellate.org/, that JSTOR affiliated service Eric mentions!)



We are similar to say ProQuest's TDM Studio or Gales' Digital Scholar Lab in that we have a user interface for building datasets (with some visualizations) and a cloud based compute environment.  However, our model is a little different.  Constellate allows anyone in the world to build datasets of content and download them -- you do not need to participate in Constellate or have a subscription to the original content.  Content in Constellate is designated as either rights restricted or open.  When rights restricted data is included in a dataset, that dataset only includes non-consumptive data, whereas for open content, the dataset also includes the full-text.  Over 3 million documents in Constellate from JSTOR, Portico, and third party resources are open.  We also have a carve out for JSTOR rights restricted content, whereby after a formal request review, we will package that full-text up for researchers. These services are not part of our subscription service and are available to anyone.  You can read more both about what content is included and what its permissions<https://constellate.org/docs/data-sources> are and our dataset options<https://constellate.org/docs/dataset-options>.   We currently limit datasets to 25,000 for not participants, but we are happy to help folks who need more content (most don't, however) and we are working to change the UI limit to a larger number (probably around 200,000 - it'll be the size above which the files simply become too big for most people.)  We don't get a lot of requests for larger datasets, so it hasn't become enough of a priority to bump our other to do list items.

Our primary focus is on teaching and learning - which is the real benefit of a Constellate subscription.  We believe that the ability to read, understand and communicate data as information is one of the most essential skills for the future of education and employment. Because of that, we sought to build a text analysis program that enables every librarian and faculty members in all disciplines to teach these skills. Constellate combines the content and tools users need to perform text analysis alongside a defined curriculum and robust tutorials, live classes taught by text analysis experts, and an inspiring and supportive community of users.  Our participants get to send their community members to our classes<https://constellate.org/events/constellate-class-intermediate-python> and use Constellate (including the cloud based compute environment) to teach!  We don't actually see much research happening in the Constellate Lab.  As Eric said, most researchers want to do that work locally in their own environments.



If you'd like to learn more about Constellate or have questions, just let me know!



~ Amy


--
Amy J. Kirchhoff (she/her)
Constellate<https://constellate.org/> Business Manager / Portico, JSTOR
Twitter: @AmyPlusFour

Take your research further with text and data analysis skills!





-----Original Message-----
From: Code for Libraries <[log in to unmask]> On Behalf Of Eric Lease Morgan
Sent: Tuesday, November 22, 2022 9:06 AM
To: [log in to unmask]
Subject: Re: [CODE4LIB] ProQuest TDM? Alternatives? Open source alternatives?



>>>>>Caution: This message did not originate from within ITHAKA's email

>>>>>system. Please use caution when opening attachments and following

>>>>>links within this message.<<<<<



On Nov 21, 2022, at 5:12 PM, RD B <[log in to unmask]<mailto:[log in to unmask]>> wrote:



> We (Kelvin Smith Library, Case Western Reserve University) are

> considering the ProQuest TDM Studio:

>

>   https://about.proquest.com/en/products-services/TDM-Studio/

>

> I was curious if anyone here had any direct experience with the system

> they could share, or if there were alternatives that the community

> recommends and why.

>

> --

> R. David Beales - [log in to unmask]<mailto:[log in to unmask]> - 732-299-0390 Library, Earth,

> Sol System, Orion-Cygnus Arm of the Milky Way Galaxy, Laniakea

> Supercluster





A couple of years ago I experimented with TDM Studio, and I can report that it worked as advertised.



More specifically, Studio worked like the handful of similar services. One from Lexis/Nexus, one from JSTOR, and the one from the HathiTrust. What does that mean? It means a person:



  1. searches the given collection

  2. results are subsetted to a secure location

  3. using tools and APIs provided by the vendor,

     analysis is done against the results

  4. results are exported

  5. repeat until done



Many times the tools and API require a working knowledge of the Python programming language, and then there is the curve of learning the specific tools. The tools usually include a number of modeling techniques: bibliography creation, ngram analysis, topic modeling, and full text searching. After working in this area for a more than a few years now, these techniques ought to be considered rudimentary, and additional techniques such as the application of grammars, semantic indexing, and collocations ought to be included.



All of the vendors have their hands tied by contract and copyright. Each vendor has made agreements with publishers not to freely share content, but it is not possible to do text mining, natural language processing, nor data science with words sans the content. Consequently each vendor implements a variation on Step #2, above. The process would be a h3ll of a lot easier if the student, researcher, or scholar could:



  1. search content

  2. select items of interest

  3. download selected items sans click,

     save, click, save, click, save, etc.

  4. use a wide variety of GUI tools,

     command-line tools, or programming

     languages to do the analysis



Here licensing is probably the limiting factor, not copyright.



Do I know of open source alternatives? No, not really, but I hope my Reader addresses some of these problems. Given a set of files of just about any number and just about any ilk and saved in a local folder/directory, the Reader:



  * converts the files to plain text

  * does all sorts of feature extraction against the result

  * distills the features into a data set (a "study carrel")

  * provides the means to compute against the data set, and

    the computing could done with GUI tools (like OpenRefine

    or AntConc), command-line tools (like grep or jq), or

    programming libraries (like Python's NLTK or spaCy)



In the end the Reader supports all of the modeling techniques alluded to above as well as a few others. Consequently, a person can search any vendor for content of interest, download the results (through click, click, click), and do analysis against the result.



Like all software, the Reader is never done and ought to be considered beta-ware, See:



  https://distantreader.org



HTH



P.S. David, nice signature.



--

Eric Lease Morgan

Navari Family Center for Digital Scholarship Hesburgh Libraries University of Notre Dame



574/631-8604

https://cds.library.nd.edu