Print

Print


On Sep 26, 2017, at 4:41 PM, Thomale, Jason <[log in to unmask]> wrote:

>>>> Does anybody here know how to access a Python compressed sparse row format (CSR) object? [1]
>>>> 
>>>> [1] CSR - http://bit.ly/2fPj42V
>>> 
>>> Do you have a link to the code you're using?
>> 
>> Yes, thank you. See —> http://dh.crc.nd.edu/sandbox/htrc-workset-browser/bin/topic-model.py  —ELM
> 
> I'm not familiar with the APIs in question, but--if I'm looking at this right, your CSR matrix (tfidf) looks like it would have columns corresponding with topics and rows corresponding with documents. If that's the case, you could maybe do something like this:
> 
>   1. Use tfidf.getcol() to get the column corresponding
>      to your chosen topic. Looks like that should give you a
>      1-dimensional matrix of all document scores for that
>      topic.
> 
>   2. Cast that to an array of scores using .toarray(),
>      and then a list with .tolist(). (I think?)
> 
>   3. Use a list comprehension and "enumerate" to generate
>      explicit doc IDs based on each document's position in
>      the list, creating a list of 2-element lists or tuples,
>      (doc_id, score). While you're at it, you could filter
>      the list comprehension to give you only the documents
>      with scores that are greater than 0, or some other
>      threshold.
> 
>   4. Pass the results through the built-in "sorted"
>      function to sort your list of tuples based on score.
> 

> >>> topic = 9497
> >>> score_thresh = 0
> >>> topic_scores = tfidf.getcol(topic).toarray().tolist()
> >>> docs_and_scores = [(score[0], score[1]) for score in enumerate(topic_scores) if item[1] > score_thresh]
> >>> most_relevant_docs = sorted(docs_and_scores, key=lambda x: x[1])
> 
> The resulting "most_relevant_docs" variable should be a list of tuples that looks something like this (for example):
> [(102, 0.9), (33, 0.875), (365, 0.874), ...]
> 
> Not sure if that's helpful...? There's probably a more numpy/scipy way of doing the above using actual numpy array methods (especially the 4th line).


Jason, this is REALLY close, and I have begun to include it at the very end of my code. Thank you! ‘More later. code4lib++  —Eric Morgan