Because many of us teach or lead various text analytics and data mining classes and projects, some might find this open data set helpful.
Please share widely. The dataset was created to be used by all and sundry in and out of the classroom.
HTRC is excited to announce the release of the Extracted Features 2.0 dataset! This new version of Extracted Features offers volume- and page-level data for 17+ million volumes in the HathiTrust Digital Library. The data include:
* Bibliographic metadata
* Computationally-inferred metadata about the page, such as language and line counts
* Tokens (words), parts of speech, and their per-page counts
Overall, the dataset represents more than 6 billion pages of text from the digital library and includes nearly 3 trillion tokens from the corpus.
Not only does this release extend the number of volumes in HathiTrust available as Extracted Features, it also incorporates linked data such that names in the files are linked to external authorities when possible.
Learn more about the release and data schema: https://wiki.htrc.illinois.edu/x/kYC2B<https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.htrc.illinois.edu_x_kYC2B&d=DwMFAg&c=Y6HT0gyZH_Z4ZSRJdNYJeQ&r=PoPNiojADUuqnTf-KX_TBzefh1aDEwmrF4a1xlfAZ-I&m=jIpyTDd57dx1dpU4liD2-4OMyQd5KxqDmGLDuV8Ooy8&s=33FGLOvfqEpo-r7Tl8B7zyKLrk8DU6M7vuPzUWEleA4&e=>
Download Extracted Features 2.0 files: https://wiki.htrc.illinois.edu/x/_QGGAQ<https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.htrc.illinois.edu_x_-5FQGGAQ&d=DwMFAg&c=Y6HT0gyZH_Z4ZSRJdNYJeQ&r=PoPNiojADUuqnTf-KX_TBzefh1aDEwmrF4a1xlfAZ-I&m=jIpyTDd57dx1dpU4liD2-4OMyQd5KxqDmGLDuV8Ooy8&s=yJEVVbmvHZlQ_NbZhEoHR_LsXCGneLL3ZnqN5JIv4Wo&e=>
Contact [log in to unmask]<mailto:[log in to unmask]> with any questions.