19 Jan 2025: The Unreasonable Effectiveness of Shortening Vectors

So following up on 12 Jan 2025: I ain’t afraid of nothin’, I was ready to take the plunge into vector database and all the complexity that entails (using a CDC pattern to sync data between my Postgres and my new vector database etc.).

However, after speaking to Ammar, he gave me two suggestions:

First, we could try shortening the vectors. While checking out the OpenAI API, I realized this was a possibility, but I did not seriously consider this possibility until he mentioned it.
At query time, load the vectors and do an in-memory HNSW vector index + search. Conveniently, Ammar has written a (perf-optimized?) library for this in Golang.

Shortening vectors

The TLDR is this worked like a charm. While looking up on this (for both the theoretical basis + the accepted best practice), I could not find very much except for:

This OpenAI blog post, which says

[O]n the MTEB benchmark, a text-embedding-3-large embedding can be shortened to a size of 256 while still outperforming an unshortened text-embedding-ada-002 embedding with a size of 1536.
A friend pointed out that vectors that are Matryoshka (??) can have their dimensions reduced this way. In the same OpenAI blog post, the footnote links to this paper, which contains the following:

To be honest, this wasn’t very much to go on and I also did not fully understand the theory behind shortening vectors. I was sceptical: even if this successfully sped up the search, isn’t it not scalable? What if the size of the database gets even larger?

But I knew the only way to know for sure is to try, and I tested the same queries across:

A moderately sized database (~200k vectors)
A larger database of ~1 million vectors

At least for my use case, this was unreasonably effective:

The HNSW index is now much smaller; around the same size as the data
The sequential scan is now much faster, allowing me to raise the threshold before I route to HNSW index search instead
The filtering problem was more or less solved because once the threshold is raised, even after filtering we still have sufficiently many results to show. (I guess this is because, for my use case, I ultimately only needed a fixed number of results to show to the user on the frontend)

While I am not afraid of toil and hard work, this was truly an example of the 80/20 rule. Setting up a new database with a shortened vector column was pretty easy and the results are really good. In fact, it took much longer to sync the embeddings (and exposed a few flaws in my syncing process). While I was waiting for the sync to complete, I decided to try the in-memory vector index approach.

Foray into Golang

The idea is an enticing as it sounds crazy. But the napkin math works out:

Each 256D vector is 1KB. So if I want to search across 20k vectors, that would be 20MB. Assuming a good database connection, it should not take too long to query 20MB of data.