LISTSERV 16.5 - CODE4LIB Archives

On Sep 27, 2017, at 8:18 AM, Eric Lease Morgan <[log in to unmask]> wrote:

>>>>> Does anybody here know how to access a Python compressed sparse row format (CSR) object? -> http://bit.ly/2fPj42V
>>>> 
>>>> Do you have a link to the code you're using?
>>> 
>>> Yes, thank you. See —> http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py  —ELM
>> 
>> I'm not familiar with the APIs in question, but--if I'm looking at this right, your CSR matrix (tfidf) looks like it would have columns corresponding with topics and rows corresponding with documents...
>> 
>> Jason, this is REALLY close, and I have begun to include it at the very end of my code. Thank you! ‘More later.


Jason’s suggestions were very helpful, and after hacking on my topic modeling program I am giving the program a label of version 1.0. See -> http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py

To resolve my issues I had to:

  1. learn that my vectorizer (TfidfVectorizer) could take
     a list of file names as input, thus the file names
     come along for the ride in the resulting matrices 

  2. exploit Jason’s suggestion to extract the file names
     from a list of ranked (sorted) topics

If you have a directory of plain text files, and if you have Python (as well as all of its friends) installed, then you can run the program something like this:

  $ ./topic-model.py ./shamanism/text/ 5 5 3

The result will be a list of topics (think “subject terms”) and their most associated files:

  * god; church; christian; divine; christ
    o ./shamanism/text/uva.x002756372.txt
    o ./shamanism/text/uc1.$b43226.txt
    o ./shamanism/text/mdp.39015062241685.txt

  * god; gods; primitive; spirits; worship
    o ./shamanism/text/hvd.ah59xi.txt
    o ./shamanism/text/uc2.ark+=13960=t3mw2kc4t.txt
    o ./shamanism/text/mdp.39015025016869.txt

  * social; cultural; culture; group; economic
    o ./shamanism/text/uc1.b4558415.txt
    o ./shamanism/text/uc1.$b604512.txt
    o ./shamanism/text/mdp.39015003464057.txt

  * russian; siberia; river; south; russia
    o ./shamanism/text/uc2.ark+=13960=t1qf8s666.txt
    o ./shamanism/text/mdp.39015068416885.txt
    o ./shamanism/text/njp.32101068979754.txt

  * cf; hebrew; el; text; babylonian
    o ./shamanism/text/nyp.33433081840559.txt
    o ./shamanism/text/wu.89097203632.txt
    o ./shamanism/text/uc2.ark+=13960=t84j0cs6h.txt

What’s really cool is that I can now search my corpus for terms like god, church, or Christian, and the ranked files float to the top; this particular topic modeling process works for me, and now I can “turn the knobs” to improve the results as well as consider plotting the results on a Cartesian plane to visualize similarity.

Fun with text mining. 

—
Eric Lease Morgan
University of Notre Dame