Dear Eric,
I am glad to see code4lib folks engaging with these techniques. I am also working on an experimental project to develop some experience with ML techniques that center on vectorized representations of bibliographic description. I note this simply as a caveat to say that I hope this is not read as criticism, but as constructive discussion in an area (the math of data science) for which library technologists should, in my opinion, develop greater expertise.
I am not sure it makes sense to simply create some kind of vector representation of sentences or sentence fragments for the purpose of search/retrieval. Or, at a minimum, it seems like you would need provide more details on exactly how the vectorization is implemented and how it is intended to function.
First, I would note that the basic cosine similarity function [1] for two vectors looks *a lot* like what many of us in library land understand about search index relevancy in systems like Lucene/Solr based on term frequency–inverse document frequency (TF-IDF) [2]. In order to see this, I recommend playing around with a few small binary vectors. Define the vector space (array size) as all words (terms/tokens) in your corpus/vocabulary with a defined order (i.e., sorted). Then for each document being vectorized, if that document includes a word, place a 1 in that word's position in its vector, otherwise a 0.
The instructional benefit of doing this is that the math gets very easy for all the summation involved in the cosine similarity function when all vector values are 0s and 1s. You see that the summing of 1s (that are not multiplied by 0s) ends up being equivalent to term overlap counts and doing arithmetic on 0s and 1s is conceptually simple. But also, importantly, you can see how the cosine similarity function relates to basic term percentages. And that brings you back to the relationship between good old Boolean operations and overlap percentages relative to the corpus, i.e., TF-IDF.
So this brings me to my initial point: the reason I don't think vectorization makes sense (or maybe it is better phrased as "is not that useful") for a *search* UX is because TF-IDF is conceptually already nearly the same thing, but a conceptually more simple thing to understand in terms of Boolean logic. Thinking about relevancy scores (similarity scores between user query and a list of matching documents) in Boolean terms can be easier than thinking thru the math behind the scores. A cosine similarity function is conceptually just a math-y way to do relevancy rank of a Boolean logic-based search.
I am discovering that the challenge here is that we need to first define the meaning/intention behind a vector space. A vector is a numerical representation of some data space and it is very hard to quantify the richness/depth of words in human language. So what does the quantification represent? The simplest form of statistics is percentages, a distribution of values. Search in terms of TF-IDF already has us covered for term distribution in a corpus for the purpose of search-style information retrieval. So what other percentages/distributions of bibliographic description terms might yield interesting results? What is the purpose/meaning/intention of the vector space?
This is a very new area of study for me, so I would love to hear from others in the code4lib community with more experience, especially as relates to the math. If I have something wrong here, please correct me.
Cheers,
Steve
[1] https://en.wikipedia.org/wiki/Cosine_similarity#Definition
[2] https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition
On May 16, 2025, at 9:13 AM, Himpe, Christian <[log in to unmask]> wrote:
Dear Eric,
think of the distance measures in terms of Lp-norms ( https://urldefense.com/v3/__https://en.wikipedia.org/wiki/Lp_space__;!!Mak6IKo!MJXHjFYZ47wUHYpk3wOLrKxoDWHzrjJqdGSKBnVJlQOrUC5v05WdAfN1Tw6wGdh5IVrVpam-te8OYOV95cB4wV_cOr6YUckGKxnobg$ ):
Euclidean and Cosine distances are both basically L2 norms, while the Manhattan distance is the L1-norm. For sparse settings you typically want to use an L1-norm (in mathematical optimization); google for L1-norm and sparsity for explanations. An alternative L1 like distance to try would be the Canberra distance. Then again, this is all highly depends on the employed numerical algorithms and their configurations.
Best
Christian Himpe
PS: ArcadeDB has an overview of similarity measures: https://urldefense.com/v3/__https://docs.arcadedb.com/*similarity__;Iw!!Mak6IKo!MJXHjFYZ47wUHYpk3wOLrKxoDWHzrjJqdGSKBnVJlQOrUC5v05WdAfN1Tw6wGdh5IVrVpam-te8OYOV95cB4wV_cOr6YUckqj9M3Zg$
________________________________________
Von: Code for Libraries <[log in to unmask]<mailto:[log in to unmask]>> im Auftrag von Eric Lease Morgan <[log in to unmask]<mailto:[log in to unmask]>>
Gesendet: Freitag, 16. Mai 2025 15:23:28
An: [log in to unmask]<mailto:[log in to unmask]>
Betreff: [CODE4LIB] distance measures in vector similarity search
What distance measure do you suggest I use when implementing vector similarity search?
I have piles o' sentences. Almost more than I count, literally. I have successfully looped through subsets of these sentences, vectorized them (think "indexed"), and stored the result in a Postgres database through the use of an extension called pgvector. [1]
To query the database one submits strings. Unlike most database implementations, the queries require zero syntax. None. Example queries include:
* nature
* nature is was
* man is was men are were
* ahab peleg bildad mate stood cabin stranger arm
* what is the definition of social justice?
These queries are vectorized in the same way the original sentences were vectorized. The query is compared to all the vectors in the database, and the vectors computed as "closest" to the query are returned. In other words, I get back a set of sentences containing some or all of the words from the query. Example results might be:
* Nature is the fundemental environment in which we live.
* And then he said, "Man is beside himself with arogance."
* The first mate looked at Ahab, which many call Pegleg, and stood in amazement.
* The definition of social justice is nuanced.
Herein lies the rub; there are many ways to compute "closest", which ought I employ? Pgvector supports the seemingly most common measures:
* Euclicdian distance
* Manhattan distance
* Cosine distance
There are many Internet postings comparing and contrasting such distance measures. For example, see something from a site called Medium. [2]
Dispite my reading, I'm stymied. Which measure should I employ? All of my vectors are the same size, but since my vectors are "sparse", meaning they contain many zero values, I think I'm leaning away from Cosine distance and towards Euclidian distance. Finally, to be honest, no matter what measure I employ, the results are very similar. Go figure.
Do you have any suggestions?
[1] https://urldefense.com/v3/__https://github.com/pgvector/pgvector__;!!Mak6IKo!MJXHjFYZ47wUHYpk3wOLrKxoDWHzrjJqdGSKBnVJlQOrUC5v05WdAfN1Tw6wGdh5IVrVpam-te8OYOV95cB4wV_cOr6YUcliDNiEEw$
[2] https://urldefense.com/v3/__https://medium.com/advanced-deep-learning/understanding-vector-similarity-b9c10f7506de__;!!Mak6IKo!MJXHjFYZ47wUHYpk3wOLrKxoDWHzrjJqdGSKBnVJlQOrUC5v05WdAfN1Tw6wGdh5IVrVpam-te8OYOV95cB4wV_cOr6YUcnx3IhriQ$
--
Eric Morgan
Navari Family Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame
|