What distance measure do you suggest I use when implementing vector similarity search?
I have piles o' sentences. Almost more than I count, literally. I have successfully looped through subsets of these sentences, vectorized them (think "indexed"), and stored the result in a Postgres database through the use of an extension called pgvector. [1]
To query the database one submits strings. Unlike most database implementations, the queries require zero syntax. None. Example queries include:
* nature
* nature is was
* man is was men are were
* ahab peleg bildad mate stood cabin stranger arm
* what is the definition of social justice?
These queries are vectorized in the same way the original sentences were vectorized. The query is compared to all the vectors in the database, and the vectors computed as "closest" to the query are returned. In other words, I get back a set of sentences containing some or all of the words from the query. Example results might be:
* Nature is the fundemental environment in which we live.
* And then he said, "Man is beside himself with arogance."
* The first mate looked at Ahab, which many call Pegleg, and stood in amazement.
* The definition of social justice is nuanced.
Herein lies the rub; there are many ways to compute "closest", which ought I employ? Pgvector supports the seemingly most common measures:
* Euclicdian distance
* Manhattan distance
* Cosine distance
There are many Internet postings comparing and contrasting such distance measures. For example, see something from a site called Medium. [2]
Dispite my reading, I'm stymied. Which measure should I employ? All of my vectors are the same size, but since my vectors are "sparse", meaning they contain many zero values, I think I'm leaning away from Cosine distance and towards Euclidian distance. Finally, to be honest, no matter what measure I employ, the results are very similar. Go figure.
Do you have any suggestions?
[1] https://github.com/pgvector/pgvector
[2] https://medium.com/advanced-deep-learning/understanding-vector-similarity-b9c10f7506de
--
Eric Morgan
Navari Family Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame
|