On Sep 27, 2017, at 8:18 AM, Eric Lease Morgan <[log in to unmask]> wrote:
>>>>> Does anybody here know how to access a Python compressed sparse row format (CSR) object? -> http://bit.ly/2fPj42V
>>>> Do you have a link to the code you're using?
>>> Yes, thank you. See —> http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py —ELM
>> I'm not familiar with the APIs in question, but--if I'm looking at this right, your CSR matrix (tfidf) looks like it would have columns corresponding with topics and rows corresponding with documents...
>> Jason, this is REALLY close, and I have begun to include it at the very end of my code. Thank you! ‘More later.
Jason’s suggestions were very helpful, and after hacking on my topic modeling program I am giving the program a label of version 1.0. See -> http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py
To resolve my issues I had to:
1. learn that my vectorizer (TfidfVectorizer) could take
a list of file names as input, thus the file names
come along for the ride in the resulting matrices
2. exploit Jason’s suggestion to extract the file names
from a list of ranked (sorted) topics
If you have a directory of plain text files, and if you have Python (as well as all of its friends) installed, then you can run the program something like this:
$ ./topic-model.py ./shamanism/text/ 5 5 3
The result will be a list of topics (think “subject terms”) and their most associated files:
* god; church; christian; divine; christ
* god; gods; primitive; spirits; worship
* social; cultural; culture; group; economic
* russian; siberia; river; south; russia
* cf; hebrew; el; text; babylonian
What’s really cool is that I can now search my corpus for terms like god, church, or Christian, and the ranked files float to the top; this particular topic modeling process works for me, and now I can “turn the knobs” to improve the results as well as consider plotting the results on a Cartesian plane to visualize similarity.
Fun with text mining.
Eric Lease Morgan
University of Notre Dame