Came across this, thought it’s cool:
Don't use cosine similarity carelessly - Piotr Migdał
I also came across the following guide from Supabase and this particular paragraph stuck with me:
Cosine distance is a safe default when you don't know whether or not your embeddings are normalized. If you know for a fact that they are normalized (for example, your embedding is returned from OpenAI), you can use negative inner product (
<#>
) for better performance:-- Match documents using negative inner product (<#>) create or replace function match_documents ( query_embedding vector(512), match_threshold float, match_count int ) returns setof documents language sql as $$ select * from documents where documents.embedding <#> query_embedding < -match_threshold order by documents.embedding <#> query_embedding asc limit least(match_count, 200); $$;
Note that since
<#>
is negative, we negatematch_threshold
accordingly in thewhere
clause. For more information on the different operators, see the pgvector docs.
In my performance optimization, this was one stone I left unturned.