On Sep 27, 2017, at 8:18 AM, Eric Lease Morgan <[log in to unmask]> wrote: >>>>> Does anybody here know how to access a Python compressed sparse row format (CSR) object? -> http://bit.ly/2fPj42V >>>> >>>> Do you have a link to the code you're using? >>> >>> Yes, thank you. See —> http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py —ELM >> >> I'm not familiar with the APIs in question, but--if I'm looking at this right, your CSR matrix (tfidf) looks like it would have columns corresponding with topics and rows corresponding with documents... >> >> Jason, this is REALLY close, and I have begun to include it at the very end of my code. Thank you! ‘More later. Jason’s suggestions were very helpful, and after hacking on my topic modeling program I am giving the program a label of version 1.0. See -> http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py To resolve my issues I had to: 1. learn that my vectorizer (TfidfVectorizer) could take a list of file names as input, thus the file names come along for the ride in the resulting matrices 2. exploit Jason’s suggestion to extract the file names from a list of ranked (sorted) topics If you have a directory of plain text files, and if you have Python (as well as all of its friends) installed, then you can run the program something like this: $ ./topic-model.py ./shamanism/text/ 5 5 3 The result will be a list of topics (think “subject terms”) and their most associated files: * god; church; christian; divine; christ o ./shamanism/text/uva.x002756372.txt o ./shamanism/text/uc1.$b43226.txt o ./shamanism/text/mdp.39015062241685.txt * god; gods; primitive; spirits; worship o ./shamanism/text/hvd.ah59xi.txt o ./shamanism/text/uc2.ark+=13960=t3mw2kc4t.txt o ./shamanism/text/mdp.39015025016869.txt * social; cultural; culture; group; economic o ./shamanism/text/uc1.b4558415.txt o ./shamanism/text/uc1.$b604512.txt o ./shamanism/text/mdp.39015003464057.txt * russian; siberia; river; south; russia o ./shamanism/text/uc2.ark+=13960=t1qf8s666.txt o ./shamanism/text/mdp.39015068416885.txt o ./shamanism/text/njp.32101068979754.txt * cf; hebrew; el; text; babylonian o ./shamanism/text/nyp.33433081840559.txt o ./shamanism/text/wu.89097203632.txt o ./shamanism/text/uc2.ark+=13960=t84j0cs6h.txt What’s really cool is that I can now search my corpus for terms like god, church, or Christian, and the ranked files float to the top; this particular topic modeling process works for me, and now I can “turn the knobs” to improve the results as well as consider plotting the results on a Cartesian plane to visualize similarity. Fun with text mining. — Eric Lease Morgan University of Notre Dame