On Jul 18, 2025, at 12:41 PM, Tara Calishain <[log in to unmask]> wrote:
> I have zero experience but I love this idea. I have been thinking a lot
> lately about "atomizing" topics into clouds of contextual items for purpose
> of exploring/assembling with different search methods, and your breaking
> things down into sentences fits right in.
Tara, you echo something I have come to believe, namely, sentences are the smallest unit of information in a narrative text. In my mind, words are merely data, but when words are associated with each other they become information.
That said, it is almost trivial to extract all the sentences from a plain text file. Here is something written in Python and uses the Natural Language Toolkit to do the heavy lifting:
# file2sentences.py - given a plain text file, output all of its sentences
# require
from nltk import sent_tokenize
from sys import argv, exit
from re import sub
# get input
if len( argv ) != 2 : exit( 'Usage: ' + argv[ 0 ] + " <file>" )
file = argv[ 1 ]
# read the given file and parse it into sentences
text = open(file).read()
sentences = sent_tokenize(text)
# apply rudimentary normalization to the sentences
sentences = [sentence.replace('\t', ' ') for sentence in sentences]
sentences = [sentence.replace('\r', ' ') for sentence in sentences]
sentences = [sentence.replace('\n', ' ') for sentence in sentences]
sentences = [sentence.replace('- ', '') for sentence in sentences]
sentences = [sub(' +', ' ', sentence) for sentence in sentences]
sentences = [sub('^ ', '', sentence) for sentence in sentences]
sentences = [sub(' $', '', sentence) for sentence in sentences]
# output
for sentence in sentences : print(sentence)
# done
exit
If one were to create a database of sentences, then I believe the database would require at least three fields in order to be useful:
1. a field to be used as an identifier, and that could be
the name of the file whence the sentence came
2. a field containing an integer denoting the sentence's
ordinal (first, second, third, fourth, etc.)
3. a field for the sentence itself; duh!
For extra credit, the database could include a joined table for bibliographics: authors, titles, dates, etc.
Adding embeddings to the database is an extra step, but given the Python script and database outline one can go a long long way when it comes to taking a traditional back-of-the-book index to the next level. For example, one could use rudimentary counting and tabulating techniques to measure the frequency of words ("themes") in the sentences and illustrate how those words are used over the course of the text. Such is something many people want to know, namely, "To what degree is a given idea manifested in a text, and how does that idea ebb and flow?"
--
Eric Morgan <[log in to unmask]>
|