On Jul 21, 2025, at 5:57 PM, Wolfe, Erin <[log in to unmask]> wrote:
> I did a little bit of work in this direction last year, where I tokenized a text into sentences and used a BERT model to create embeddings for each sentence. Then I took a predefined large dictionary of related terms (i.e., all related to the same general topic) and embedded each of these terms. I then used a cosine similarity check to try to identify sentences that were related to the topic based on embedding similarity.
>
> The results were interesting and often correct, but not nearly accurate enough to use them in a meaningful way. Granted, this was using a zero-shot untrained match (“bert-base-uncased”). Likely fine tuning this on a training set of data would have yielded better results. However, I ended up going a different route for this project that gave me more precise results, so I didn’t explore the embeddings approach much further.
>
> It’s an interesting topic for discussion, though, and I think there’s definitely some promise there!
>
> --
> Erin
Interesting!
With my recent interest in lexicons, I thought about applying zero-shot classification to content. Here's how:
1. articulate a lexicon
2. create a collection content (documents, paragraphs, sentences, etc.)
3. use zero-shot classification to classify the content using the
lexicon as the classification system
I have done this a few times, and most recently applied it to a set of 4,500 reference questions; working with a colleague we classified reference questions with the purposes of understanding the types of questions being asked. Too some degree, the same process could be applied to title/abstract combinations from journal articles.
Like alll things, the process was not perfect. That said, it was very insightful, and IMHO, can be seen as a supplement to more traditional analysis processes.
--
Eric Morgan <[log in to unmask]>
|