20 Jan 2025: Cool blog post on cosine similarity and vector embedding

Came across this, thought it’s cool:

Don't use cosine similarity carelessly - Piotr Migdał

I also came across the following guide from Supabase and this particular paragraph stuck with me:

Cosine distance is a safe default when you don't know whether or not your embeddings are normalized. If you know for a fact that they are normalized (for example, your embedding is returned from OpenAI), you can use negative inner product (<#>) for better performance:
-- Match documents using negative inner product (<#>)
create or replace function match_documents (
  query_embedding vector(512),
  match_threshold float,
  match_count int
)
returns setof documents
language sql
as $$
  select *
  from documents
  where documents.embedding <#> query_embedding < -match_threshold
  order by documents.embedding <#> query_embedding asc
  limit least(match_count, 200);
$$;
Note that since <#> is negative, we negate match_threshold accordingly in the where clause. For more information on the different operators, see the pgvector docs.

In my performance optimization, this was one stone I left unturned.