Sounds like a classical use for the tf–idf measure.
For those with no background in information retrieval, see
...let us be heard from red core to black sky
On Sat, 11 Jul 2020 at 06:58, Eric Lease Morgan <[log in to unmask]> wrote:
> To stop word, or not to stop word? That is the question.
> Seriously, I am working with a team of people to index and analyze a set of 65,000 - 100,000 full text scientific journal articles, and all of the articles are on the topic of COVID-19.  We have indexed the data set and we have created subsets of the data, affectionately called "study carrels". Each study carrel is characterized with a short name and a few bibliographic-like features.  Within each study carrel are a number of different analyses, such as ngram frequencies, parts-of-speech enumerations, and topic modeling.
> Each article in each carrel also has a set of "keywords" extracted from it. These keywords are computed, and for all intents & purposes, the computation is pretty good. For example, see a set of keywords from a particular carrel.  Unfortunately, many of the study carrels have very very very similar sets of keywords. Again, if you peruse the set of all the carrels  you see the preponderance of keywords such as "cell", "covid-19", "SARS", and "patient". These words happen so frequently that they become (almost) meaningless.
> My questions to y'all are, "When and where should I add something like 'cell', or better yet 'covid-19', to my list of stopwords?"
>  data set of articles - https://www.semanticscholar.org/cord19
>  study carrels - https://cord.distantreader.org/carrels/INDEX.HTM
>  example keywords - https://cord.distantreader.org/carrels/kaggle-risk-factors/index.htm#keywords
> Eric Morgan