LISTSERV 16.5 - CODE4LIB Archives

Does anybody here know how to access a Python compressed sparse row format (CSR) object? [1]

I am using Python to do a bit of topic modeling (think “classification”), and so far, the results are more than plausible, but the results only return topics not documents corresponding to the topics. Along the way, my script creates a compressed sparse row format object, and it looks something like this:

  (0, 16099)	0.055924002143
  (0, 9497)	0.0256051292226
  (0, 16202)	0.140746540109
  (0, 38982)	0.000842900625312
  :	:
  (309, 40805)	0.0435077792741
  (309, 45679)	0.0435077792741
  (309, 19462)	0.0435077792741
  (309, 8346)	0.0435077792741
  (309, 31204)	0.0435077792741

Where the first column denotes a document identifier, the second column denotes a topic identifier, and the third column denotes the score of the topic in the document. In the example above, document #0 is a lot about topic #16202 but not a lot about topic #38982.

I want to query my CSR object. For example, given a topic identifier (ie. 48692), return a list of all document identifiers and scores from the object. I will then sort the scores to find which documents which most significantly use the given topic.

I can’t for the life of me figure out how to get what I need. I can get specific values of rows like this where tfidf is my CRS object:

  >>> print( tfidf[ 309, 31204 ] )
  >>> 0.0435077792741

Any help would be greatly appreciated.

[1] CSR - http://bit.ly/2fPj42V

—
Eric Morgan