On Sep 27, 2017, at 8:18 AM, Eric Lease Morgan <[log in to unmask]> wrote:
>>>>> Does anybody here know how to access a Python compressed sparse row format (CSR) object? -> http://bit.ly/2fPj42V
>>>>
>>>> Do you have a link to the code you're using?
>>>
>>> Yes, thank you. See —> http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py —ELM
>>
>> I'm not familiar with the APIs in question, but--if I'm looking at this right, your CSR matrix (tfidf) looks like it would have columns corresponding with topics and rows corresponding with documents...
>>
>> Jason, this is REALLY close, and I have begun to include it at the very end of my code. Thank you! ‘More later.
Jason’s suggestions were very helpful, and after hacking on my topic modeling program I am giving the program a label of version 1.0. See -> http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py
To resolve my issues I had to:
1. learn that my vectorizer (TfidfVectorizer) could take
a list of file names as input, thus the file names
come along for the ride in the resulting matrices
2. exploit Jason’s suggestion to extract the file names
from a list of ranked (sorted) topics
If you have a directory of plain text files, and if you have Python (as well as all of its friends) installed, then you can run the program something like this:
$ ./topic-model.py ./shamanism/text/ 5 5 3
The result will be a list of topics (think “subject terms”) and their most associated files:
* god; church; christian; divine; christ
o ./shamanism/text/uva.x002756372.txt
o ./shamanism/text/uc1.$b43226.txt
o ./shamanism/text/mdp.39015062241685.txt
* god; gods; primitive; spirits; worship
o ./shamanism/text/hvd.ah59xi.txt
o ./shamanism/text/uc2.ark+=13960=t3mw2kc4t.txt
o ./shamanism/text/mdp.39015025016869.txt
* social; cultural; culture; group; economic
o ./shamanism/text/uc1.b4558415.txt
o ./shamanism/text/uc1.$b604512.txt
o ./shamanism/text/mdp.39015003464057.txt
* russian; siberia; river; south; russia
o ./shamanism/text/uc2.ark+=13960=t1qf8s666.txt
o ./shamanism/text/mdp.39015068416885.txt
o ./shamanism/text/njp.32101068979754.txt
* cf; hebrew; el; text; babylonian
o ./shamanism/text/nyp.33433081840559.txt
o ./shamanism/text/wu.89097203632.txt
o ./shamanism/text/uc2.ark+=13960=t84j0cs6h.txt
What’s really cool is that I can now search my corpus for terms like god, church, or Christian, and the ranked files float to the top; this particular topic modeling process works for me, and now I can “turn the knobs” to improve the results as well as consider plotting the results on a Cartesian plane to visualize similarity.
Fun with text mining.
—
Eric Lease Morgan
University of Notre Dame
|