To what degree have people here explored the use of embeddings to query full text? I have done some work in this regard, and I have found the results to be very informative.
Kinda sorta, I want to address questions of a corpus and get back answers. For example, given Jane Austen's Emma, I want to know, "Who is Emma?" Alternatively, I want to create a corpus of books on a given topic -- such as epistemology -- and ask the question, "What is knowledge?"
To address such things, I have created a system that:
1. extracts all the sentences from each item in a given corpus
2. saves the sentences as records in a database
3. loops through each sentence, vectorizes ("indexes") them,
and saves the results back to the database
I can then:
1. garner a query
2. vectorize the query
3. search the database
4. return the N closest matching sentences
The result is a paragraph N sentences long, and now I use any combination of the following to make sense -- read and understand -- the results:
1. consume the paragraph using the traditional reading process
2. reformat the paragraph into smaller paragraphs which is akin
to data science clustering
3. apply a large-language model to summarize the paragraph
4. apply retrieval-augmented generation (RAG) to the results and
ask a specific question
The whole thing has been a whole lot of fun. For example, here is an abbreviated interaction I had with my system regarding Emma:
# search Emma, and return 16 sentences closest to the query "emma"
$ ./bin/search.sh emma emma 16
With all dear Emma's little faults, she is an excellent creature.
I have a very sincere interest in Emma. Emma will be happy to
entertain you. Emma spoke for her. Emma could not forgive her.
repeated Emma. No more is Emma. Poor little Emma! " Emma had
done. So Emma thought, at least. (turning to Emma.) Emma was in
no danger of forgetting. " Emma seriously hoped she would. "
Emma was most sincerely interested. " Emma could say no more. "
Emma could not doubt.
# use an LLM to summarize the result
$ ./bin/summarize.sh
Overall, this passage highlights the complexities of human
relationships and the importance of sincerity and genuine
interest in building meaningful connections with others.
# use the result as the content for a RAG query
$ ./bin/elaborate.sh 'who is emma'
Based on the quotes provided, it seems that Emma is a unique and
fascinating individual. Here are some possible characteristics of
Emma:
1. caring and empathetic
2. interested in others
3. willing to entertain
4. passionate
5. forgiving
6. memorable
7. reflective
8. optimistic
9. polite
Of course, these are just some possible interpretations based on
the given quotes. The true nature of Emma may be much more
complex and multifaceted!
While I do not assert the results are correct, I do assert the results are more than plausible. They are excellent pieces of food for thought. They are hints and pointers for further investigation.
I have used this system to read all sort of things on topics like philosophy, science, religion, government, and medicine. I have used this system to read, understand, and introduce myself to Jung, Marx, Plato, Twain, and Locke. Through the process I have learned of different defintions of knowledge, the many forms of justice, and how the definition of art has changed over time.
Now, imagine this. Imagine all the books in your library have been digitized. Imagine each book is associated with a database, and the database is a list of each sentence in the book. Now imagine querying the book and getting back all the sentences -- not page numbers -- matching the query. In my mind, such a thing is very much like a back-of-the-book index but taken to the next level.
Finally, I do not advocate this sort of things as a replacement for traditional reading. Just like any tool, it can be used improperly. On the other hand, it could address the problem of information overload. I can just hear students saying, "I have done the most correct bibliographic database search, and I have identified two hundred relevant articles on my topic. How do I read them!?"
What experiences do y'all have with this sort of technology, and to what degee do you believe it is something feasible for libraries to implement?
--
Eric Morgan <[log in to unmask]>
|