Dear Eric,
think of the distance measures in terms of Lp-norms ( https://en.wikipedia.org/wiki/Lp_space ):
Euclidean and Cosine distances are both basically L2 norms, while the Manhattan distance is the L1-norm. For sparse settings you typically want to use an L1-norm (in mathematical optimization); google for L1-norm and sparsity for explanations. An alternative L1 like distance to try would be the Canberra distance. Then again, this is all highly depends on the employed numerical algorithms and their configurations.
Best
Christian Himpe
PS: ArcadeDB has an overview of similarity measures: https://docs.arcadedb.com/#similarity
________________________________________
Von: Code for Libraries <[log in to unmask]> im Auftrag von Eric Lease Morgan <[log in to unmask]>
Gesendet: Freitag, 16. Mai 2025 15:23:28
An: [log in to unmask]
Betreff: [CODE4LIB] distance measures in vector similarity search
What distance measure do you suggest I use when implementing vector similarity search?
I have piles o' sentences. Almost more than I count, literally. I have successfully looped through subsets of these sentences, vectorized them (think "indexed"), and stored the result in a Postgres database through the use of an extension called pgvector. [1]
To query the database one submits strings. Unlike most database implementations, the queries require zero syntax. None. Example queries include:
* nature
* nature is was
* man is was men are were
* ahab peleg bildad mate stood cabin stranger arm
* what is the definition of social justice?
These queries are vectorized in the same way the original sentences were vectorized. The query is compared to all the vectors in the database, and the vectors computed as "closest" to the query are returned. In other words, I get back a set of sentences containing some or all of the words from the query. Example results might be:
* Nature is the fundemental environment in which we live.
* And then he said, "Man is beside himself with arogance."
* The first mate looked at Ahab, which many call Pegleg, and stood in amazement.
* The definition of social justice is nuanced.
Herein lies the rub; there are many ways to compute "closest", which ought I employ? Pgvector supports the seemingly most common measures:
* Euclicdian distance
* Manhattan distance
* Cosine distance
There are many Internet postings comparing and contrasting such distance measures. For example, see something from a site called Medium. [2]
Dispite my reading, I'm stymied. Which measure should I employ? All of my vectors are the same size, but since my vectors are "sparse", meaning they contain many zero values, I think I'm leaning away from Cosine distance and towards Euclidian distance. Finally, to be honest, no matter what measure I employ, the results are very similar. Go figure.
Do you have any suggestions?
[1] https://github.com/pgvector/pgvector
[2] https://medium.com/advanced-deep-learning/understanding-vector-similarity-b9c10f7506de
--
Eric Morgan
Navari Family Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame
|