I'm not familiar with the APIs in question, but--if I'm looking at this right, your CSR matrix (tfidf) looks like it would have columns corresponding with topics and rows corresponding with documents. If that's the case, you could maybe do something like this:
1. Use tfidf.getcol() to get the column corresponding to your chosen topic. Looks like that should give you a 1-dimensional matrix of all document scores for that topic.
2. Cast that to an array of scores using .toarray(), and then a list with .tolist(). (I think?)
3. Use a list comprehension and "enumerate" to generate explicit doc IDs based on each document's position in the list, creating a list of 2-element lists or tuples, (doc_id, score). While you're at it, you could filter the list comprehension to give you only the documents with scores that are greater than 0, or some other threshold.
4. Pass the results through the built-in "sorted" function to sort your list of tuples based on score.
>>> topic = 9497
>>> score_thresh = 0
>>> topic_scores = tfidf.getcol(topic).toarray().tolist()
>>> docs_and_scores = [(score[0], score[1]) for score in enumerate(topic_scores) if item[1] > score_thresh]
>>> most_relevant_docs = sorted(docs_and_scores, key=lambda x: x[1])
The resulting "most_relevant_docs" variable should be a list of tuples that looks something like this (for example):
[(102, 0.9), (33, 0.875), (365, 0.874), ...]
Not sure if that's helpful...? There's probably a more numpy/scipy way of doing the above using actual numpy array methods (especially the 4th line).
Jason
-----Original Message-----
From: Code for Libraries [mailto:[log in to unmask]] On Behalf Of Eric Lease Morgan
Sent: Tuesday, September 26, 2017 12:33 PM
To: [log in to unmask]
Subject: [EXT] Re: [CODE4LIB] accessing a python compressed sparse row format object
On Sep 26, 2017, at 1:28 PM, Andromeda Yelton <[log in to unmask]> wrote:
>> Does anybody here know how to access a Python compressed sparse row format
>> (CSR) object? [1]
>>
>> [1] CSR - http://bit.ly/2fPj42V
>
> Do you have a link to the code you're using?
Yes, thank you. See —> http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py —ELM
|