I did a little bit of work in this direction last year, where I tokenized a text into sentences and used a BERT model to create embeddings for each sentence. Then I took a predefined large dictionary of related terms (i.e., all related to the same general topic) and embedded each of these terms. I then used a cosine similarity check to try to identify sentences that were related to the topic based on embedding similarity.
The results were interesting and often correct, but not nearly accurate enough to use them in a meaningful way. Granted, this was using a zero-shot untrained match (“bert-base-uncased”). Likely fine tuning this on a training set of data would have yielded better results. However, I ended up going a different route for this project that gave me more precise results, so I didn’t explore the embeddings approach much further.
It’s an interesting topic for discussion, though, and I think there’s definitely some promise there!
Erin
From: Code for Libraries <[log in to unmask]> on behalf of Eric Lease Morgan <[log in to unmask]>
Date: Friday, July 18, 2025 at 1:19 PM
To: [log in to unmask] <[log in to unmask]>
Subject: Re: [CODE4LIB] embeddings to query full text
On Jul 18, 2025, at 12:41 PM, Tara Calishain <[log in to unmask]> wrote:
> I have zero experience but I love this idea. I have been thinking a lot
> lately about "atomizing" topics into clouds of contextual items for purpose
> of exploring/assembling with different search methods, and your breaking
> things down into sentences fits right in.
Tara, you echo something I have come to believe, namely, sentences are the smallest unit of information in a narrative text. In my mind, words are merely data, but when words are associated with each other they become information.
That said, it is almost trivial to extract all the sentences from a plain text file. Here is something written in Python and uses the Natural Language Toolkit to do the heavy lifting:
# file2sentences.py - given a plain text file, output all of its sentences
# require
from nltk import sent_tokenize
from sys import argv, exit
from re import sub
# get input
if len( argv ) != 2 : exit( 'Usage: ' + argv[ 0 ] + " <file>" )
file = argv[ 1 ]
# read the given file and parse it into sentences
text = open(file).read()
sentences = sent_tokenize(text)
# apply rudimentary normalization to the sentences
sentences = [sentence.replace('\t', ' ') for sentence in sentences]
sentences = [sentence.replace('\r', ' ') for sentence in sentences]
sentences = [sentence.replace('\n', ' ') for sentence in sentences]
sentences = [sentence.replace('- ', '') for sentence in sentences]
sentences = [sub(' +', ' ', sentence) for sentence in sentences]
sentences = [sub('^ ', '', sentence) for sentence in sentences]
sentences = [sub(' $', '', sentence) for sentence in sentences]
# output
for sentence in sentences : print(sentence)
# done
exit
If one were to create a database of sentences, then I believe the database would require at least three fields in order to be useful:
1. a field to be used as an identifier, and that could be
the name of the file whence the sentence came
2. a field containing an integer denoting the sentence's
ordinal (first, second, third, fourth, etc.)
3. a field for the sentence itself; duh!
For extra credit, the database could include a joined table for bibliographics: authors, titles, dates, etc.
Adding embeddings to the database is an extra step, but given the Python script and database outline one can go a long long way when it comes to taking a traditional back-of-the-book index to the next level. For example, one could use rudimentary counting and tabulating techniques to measure the frequency of words ("themes") in the sentences and illustrate how those words are used over the course of the text. Such is something many people want to know, namely, "To what degree is a given idea manifested in a text, and how does that idea ebb and flow?"
--
Eric Morgan <[log in to unmask]>
|