Discuss on HN | Email me


Over the last few months, I have built and launched a free semantic search tool for GitHub called SemHub. In this blog post, I share what I’ve learned and why I’ve failed, so that other builders can learn from my experience. This blog post runs long and I have sign-posted each section. I have marked the sections that I consider the particularly insightful with an asterisk (*).

I have also summarized my key lessons here:

  1. Default to pgvector, avoid premature optimization.
  2. You probably can get away with shorter embeddings if you’re using Matryoshka embedding models.
  3. Filtering with vector search may be harder than you expect.
  4. If you love full stack TypeScript and use AWS, you’ll love SST. One day, I wish I can recommend Cloudflare in equally strong terms too.
  5. Building is only half the battle. You have to solve a big enough problem and meet your users where they’re at.

Genesis

GitHub is the default place to host open source projects. But this privileged position also means that GitHub does not really need to compete on improving its UX. I am sure many developers are like me and have schlep blindness as regards how bad GitHub’s UX is. Just in terms of searching issues:

At Coder.com, we manage multiple public and private repos on GitHub and we encounter these pain points daily. Wouldn’t it be nice to be able to search across, not just the repos we own, but the repos of similar projects, to see how they approach a given problem? What if I don’t know whether the issue I am searching for is open or closed? What if I want to perform a fuzzy search?

Fortunately, for all its flaws, GitHub has a fairly open API. Ammar, a cofounder at Coder, took matters into his own hands and built a semantic search feature in Coder’s internal tool, which works surprisingly well! Surely the wider open-source community could benefit from this? He brought me on to do exactly and, thus, the idea for SemHub was born.

What is SemHub

SemHub’s goal is to enable semantic issue searches across multiple GitHub repos that anyone could use. Its core features are simple: