Skip to Content

Sift · Open Source

"Similar" and "correct" are not the same thing.


Sift · Open Source

Why I Built a Retrieval Engine Without Embeddings

By Oren · May 2026 · 6 min read

Everyone says embeddings are the answer to search and retrieval. I tried them. They got 25% accuracy on my data.

So I built something else.

The problem I was solving

I needed a retrieval engine for a real workload — hundreds of thousands of document pairs, queried in natural language. The kind of thing you'd normally throw a vector database at.

I tried embeddings first. Sentence transformers, vector similarity, the standard approach everyone recommends. On a benchmark of 175 real-world queries, the embedding approach found the right answer 25% of the time.

I tried hybrid retrieval next — embeddings plus BM25 keyword search. That got me to 33%.

Both numbers are unacceptable for any production system.

Embeddings
25%
Hybrid
33%
Sift
100%

What went wrong with embeddings

Embeddings compress meaning into fixed-length vectors. That compression loses information. When your documents contain domain-specific language, abbreviations, technical terms, or structured data, the embedding model hasn't seen enough of your domain to represent it properly.

The vector similarity score tells you "these texts are kind of related." It doesn't tell you "this is the answer."

The fundamental problem: embeddings optimize for semantic similarity, not retrieval accuracy. "Similar" and "correct" are not the same thing.

When a support agent searches for "how to reset MFA token for enterprise SSO," the embedding model returns articles about password resets, MFA setup guides, SSO overview pages — all "similar" to the query, none of them the answer. The actual procedure document has different vocabulary, different structure, but it's the one the agent needs.

What I built instead

Sift takes a completely different approach. Instead of converting text to vectors, it builds deterministic signatures from the text itself. Think of it like a fingerprint — every document gets a unique structural identity based on what's actually in it.

When a query comes in, Sift doesn't compute similarity. It matches structural patterns. The result is either the right answer or nothing. No "sort of related" results cluttering the output.

# Build signatures for your documents
$ python -m sift index --source ./documents/
Indexed 904,271 pairs in 47.3s
 
# Query in natural language
$ python -m sift query "how to reset MFA token for enterprise SSO"
Found: Enterprise SSO MFA Token Reset Procedure (confidence: 1.0, time: 11ms)
 
# Run the full benchmark
$ python -m sift benchmark --queries 175
Results: 175/175 correct · avg 15ms · peak RAM 272MB

The algorithm is fully documented in RECIPE.md on GitHub. No black box. You can read exactly how every signature is built and every match is made.

The numbers

Metric Sift Embeddings Hybrid
Accuracy (175 queries) 100% 25% 33%
Avg response time 15ms ~200ms ~350ms
Dependencies 0 5–20+ 10–30+
GPU required No Usually Usually
Works offline Yes Rarely Rarely
RAM (904K pairs) ~272MB 2–8GB 3–10GB

Zero dependencies means zero dependencies. Sift runs on Python's standard library. No numpy, no torch, no transformers, no sentence-transformers, no FAISS, no ChromaDB. Import it and it works.

Who this is for

  • Offline applications where cloud APIs aren't available
  • Privacy-sensitive environments — healthcare, legal, finance
  • Edge computing and IoT with limited hardware
  • RAG pipelines where retrieval accuracy determines output quality
  • Anyone tired of managing vector database infrastructure
  • Startups who can't afford Pinecone/Weaviate/ChromaDB costs

Where Sift doesn't fit

Sift is built for structured retrieval — finding the right document for a specific query. It's not a general-purpose semantic search engine. If you need "find me things vaguely related to this concept," embeddings are better for that. If you need "find me the exact right answer," Sift is better.

It also works best when you can define your domain. The signature-building process benefits from knowing what your documents look like. Generic web search across arbitrary content is not the target use case.

Try it

Sift is open source under MIT license. The code, the algorithm, and the full benchmark are on GitHub.

Get Sift

Open source. MIT license. Zero dependencies. Start retrieving in 5 minutes.

View on GitHub ↗ Need help integrating? →