On Oct 22, 2020, at 2:25 PM, Edward M. Corrado <[log in to unmask]> wrote:
> I have a set of just over 60,000 theses and dissertations abstracts that I
> want to automatically create keywords/topics from. Does anyone have any
> recommendations for text mining or other tools to start with?
I do this sort of thing on a regular basis, and I use a two Python libraries/modules:
1. textacy.ke.scake
2. textacy.ke.yake
Textacy is built on top of another library called "spaCy".
To use the libraries one:
1. gets a string
2. creates a spaCy doc object from the string
3. applies the scake or yake methods to the object
4. gets back a keyword (or phrase) plus a score
Attached is a script which takes a file as input and outputs a tab-delimited stream of keywords/phrases.
--
Eric Morgan
#!/usr/bin/env python
# txt2keywords.sh - given a file, output a tab-delimited list of keywords
# configure
TOPN = 0.005
MODEL = 'en_core_web_sm'
# require
import textacy.preprocessing
from textacy.ke.scake import scake
from textacy.ke.yake import yake
import spacy
import os
import sys
# sanity check
if len( sys.argv ) != 2 :
sys.stderr.write( 'Usage: ' + sys.argv[ 0 ] + " <file>\n" )
quit()
# initialize
file = sys.argv[ 1 ]
# open the given file and unwrap it
with open(file) as f: text = f.read()
text = textacy.preprocessing.normalize.normalize_quotation_marks( text )
text = textacy.preprocessing.normalize.normalize_hyphenated_words( text )
text = textacy.preprocessing.normalize.normalize_whitespace( text )
# compute the identifier
id = os.path.basename( os.path.splitext( file )[ 0 ] )
# initialize model
maximum = len( text ) + 1
model = spacy.load( MODEL, max_length=maximum )
doc = model( text )
# output a header
print( "id\tkeyword" )
# track found keywords to avoid duplicates
keywords = set()
# process and output each keyword with yake, will produce unigrams
for keyword, score in ( yake( doc, topn=TOPN ) ) :
if keyword not in keywords:
print( "\t".join( [ id, keyword ] ) )
keywords.add(keyword)
# process and output each keyword with scake, will typically produce keyphrases
# removing lemmatization with normalize=None seems to produce better results
for keyword, score in ( scake( doc, normalize=None, topn=TOPN ) ) :
if keyword not in keywords:
print( "\t".join( [ id, keyword ] ) )
keywords.add(keyword)
# done
exit()
|