I have zero experience but I love this idea. I have been thinking a lot
lately about "atomizing" topics into clouds of contextual items for purpose
of exploring/assembling with different search methods, and your breaking
things down into sentences fits right in.
On Fri, Jul 18, 2025 at 12:35 PM Eric Lease Morgan <
[log in to unmask]> wrote:
> To what degree have people here explored the use of embeddings to query
> full text? I have done some work in this regard, and I have found the
> results to be very informative.
>
> Kinda sorta, I want to address questions of a corpus and get back answers.
> For example, given Jane Austen's Emma, I want to know, "Who is Emma?"
> Alternatively, I want to create a corpus of books on a given topic -- such
> as epistemology -- and ask the question, "What is knowledge?"
>
> To address such things, I have created a system that:
>
> 1. extracts all the sentences from each item in a given corpus
> 2. saves the sentences as records in a database
> 3. loops through each sentence, vectorizes ("indexes") them,
> and saves the results back to the database
>
> I can then:
>
> 1. garner a query
> 2. vectorize the query
> 3. search the database
> 4. return the N closest matching sentences
>
> The result is a paragraph N sentences long, and now I use any combination
> of the following to make sense -- read and understand -- the results:
>
> 1. consume the paragraph using the traditional reading process
> 2. reformat the paragraph into smaller paragraphs which is akin
> to data science clustering
> 3. apply a large-language model to summarize the paragraph
> 4. apply retrieval-augmented generation (RAG) to the results and
> ask a specific question
>
> The whole thing has been a whole lot of fun. For example, here is an
> abbreviated interaction I had with my system regarding Emma:
>
> # search Emma, and return 16 sentences closest to the query "emma"
> $ ./bin/search.sh emma emma 16
>
> With all dear Emma's little faults, she is an excellent creature.
> I have a very sincere interest in Emma. Emma will be happy to
> entertain you. Emma spoke for her. Emma could not forgive her.
> repeated Emma. No more is Emma. Poor little Emma! " Emma had
> done. So Emma thought, at least. (turning to Emma.) Emma was in
> no danger of forgetting. " Emma seriously hoped she would. "
> Emma was most sincerely interested. " Emma could say no more. "
> Emma could not doubt.
>
>
> # use an LLM to summarize the result
> $ ./bin/summarize.sh
>
> Overall, this passage highlights the complexities of human
> relationships and the importance of sincerity and genuine
> interest in building meaningful connections with others.
>
>
> # use the result as the content for a RAG query
> $ ./bin/elaborate.sh 'who is emma'
>
> Based on the quotes provided, it seems that Emma is a unique and
> fascinating individual. Here are some possible characteristics of
> Emma:
>
> 1. caring and empathetic
> 2. interested in others
> 3. willing to entertain
> 4. passionate
> 5. forgiving
> 6. memorable
> 7. reflective
> 8. optimistic
> 9. polite
>
> Of course, these are just some possible interpretations based on
> the given quotes. The true nature of Emma may be much more
> complex and multifaceted!
>
>
> While I do not assert the results are correct, I do assert the results are
> more than plausible. They are excellent pieces of food for thought. They
> are hints and pointers for further investigation.
>
> I have used this system to read all sort of things on topics like
> philosophy, science, religion, government, and medicine. I have used this
> system to read, understand, and introduce myself to Jung, Marx, Plato,
> Twain, and Locke. Through the process I have learned of different
> defintions of knowledge, the many forms of justice, and how the definition
> of art has changed over time.
>
> Now, imagine this. Imagine all the books in your library have been
> digitized. Imagine each book is associated with a database, and the
> database is a list of each sentence in the book. Now imagine querying the
> book and getting back all the sentences -- not page numbers -- matching the
> query. In my mind, such a thing is very much like a back-of-the-book index
> but taken to the next level.
>
> Finally, I do not advocate this sort of things as a replacement for
> traditional reading. Just like any tool, it can be used improperly. On the
> other hand, it could address the problem of information overload. I can
> just hear students saying, "I have done the most correct bibliographic
> database search, and I have identified two hundred relevant articles on my
> topic. How do I read them!?"
>
> What experiences do y'all have with this sort of technology, and to what
> degee do you believe it is something feasible for libraries to implement?
>
> --
> Eric Morgan <[log in to unmask]>
>
|