| 
On Sep 16, 2019, at 12:20 PM, Athina Livanos-Propst <[log in to unmask]> wrote:
> I'm starting to think around a project that would involve key terms from other types of text (transcripts, captions, documents). I'm basically trying to build a tool that I can use to extra key terms from larger strings of text, i.e. pull out the important words from a larger sentence.
Fun!
Keyword extraction comes in many forms, each with their own strengths & weaknesses, advantages & disadvantages, but to do any such task, one's data MUST be transformed into plain text:
  * Frequencies - Counting & tabulating each & every ngram in a text, sans stop words, is a good place to start. It is rather unsophisticated, especially for uni-grams (one-word "phrases"), but the frequency of bi-grams, tri-grams, etc, can be quite insightful. If the frequency of uni-grams is extracted, consider identifying the lemma of each word to get a more holistic picture of the document in question.
  * TFIDF - Many relevancy ranking algorithms are rooted in TFIDF (term-frequency / inverse-document frequency), and TFIDF considers the frequency of a word, the size of the document, the number of times the word appears in the entire corpus, and the number of documents in the corpus. Calculating the TFIDF score for each word in a document, and then setting a significance threshold is a well-understood method of keyword extraction. Here's a pointer to a Perl program doing such work --> https://github.com/ericleasemorgan/reader/blob/master/bin/classify.pl
  * TextRank - This algorithm created Google, and it is probably what you want to use, especially since there is a handy-dandy Python library which implements it --> https://radimrehurek.com/gensim/summarization/keywords.html  Here is an example Python script:
  #!/usr/bin/env python
  # txt2keywords.py - given a file, output a list of keywords
  # configure; increase or decrease to change the number of desired output words
  RATIO = 0.01
  # require
  from gensim.summarization import keywords
  import sys
  # sanity check
  if len( sys.argv ) != 2 :
      sys.stderr.write( 'Usage: ' + sys.argv[ 0 ] + " <file>\n" )
      quit()
  # initialize
  file = sys.argv[ 1 ]
  # slurp up the given file
  text = open( file, 'r' ).read()
  # process each keyword; can't get much simpler
  for keyword in keywords( text, ratio=RATIO, split=True, lemmatize=True ) : print( keyword )
  # done
  quit()
<plug>By the way, the Distant Reader does all of this sort of work, and more --> https://distantreader.org  Feed the reader a set of files, and it will compute keywords, extract parts-of-speech & named-entities, summarize your documents, etc. Sample outputs are here --> http://carrels.distantreader.org</plug>
-- 
Eric Lease Morgan
Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
University of Notre Dame |