Why I Built a Retrieval Engine Without Embeddings
Everyone says embeddings are the answer to search and retrieval. I tried them. They got 25% accuracy on my data.
So I built something else.
The problem I was solving
I needed a retrieval engine for a real workload — hundreds of thousands of document pairs, queried in natural language. The kind of thing you'd normally throw a vector database at.
I tried embeddings first. Sentence transformers, vector similarity, the standard approach everyone recommends. On a benchmark of 175 real-world queries, the embedding approach found the right answer 25% of the time.
I tried hybrid retrieval next — embeddings plus BM25 keyword search. That got me to 33%.
Both numbers are unacceptable for any production system.
What went wrong with embeddings
Embeddings compress meaning into fixed-length vectors. That compression loses information. When your documents contain domain-specific language, abbreviations, technical terms, or structured data, the embedding model hasn't seen enough of your domain to represent it properly.
The vector similarity score tells you "these texts are kind of related." It doesn't tell you "this is the answer."
The fundamental problem: embeddings optimize for semantic similarity, not retrieval accuracy. "Similar" and "correct" are not the same thing.
When a support agent searches for "how to reset MFA token for enterprise SSO," the embedding model returns articles about password resets, MFA setup guides, SSO overview pages — all "similar" to the query, none of them the answer. The actual procedure document has different vocabulary, different structure, but it's the one the agent needs.
What I built instead
Sift takes a completely different approach. Instead of converting text to vectors, it builds deterministic signatures from the text itself. Think of it like a fingerprint — every document gets a unique structural identity based on what's actually in it.
When a query comes in, Sift doesn't compute similarity. It matches structural patterns. The result is either the right answer or nothing. No "sort of related" results cluttering the output.
The algorithm is fully documented in RECIPE.md on GitHub. No black box. You can read exactly how every signature is built and every match is made.
The numbers
| Metric | Sift | Embeddings | Hybrid |
|---|---|---|---|
| Accuracy (175 queries) | 100% | 25% | 33% |
| Avg response time | 15ms | ~200ms | ~350ms |
| Dependencies | 0 | 5–20+ | 10–30+ |
| GPU required | No | Usually | Usually |
| Works offline | Yes | Rarely | Rarely |
| RAM (904K pairs) | ~272MB | 2–8GB | 3–10GB |
Zero dependencies means zero dependencies. Sift runs on Python's standard library. No numpy, no torch, no transformers, no sentence-transformers, no FAISS, no ChromaDB. Import it and it works.
Who this is for
- Offline applications where cloud APIs aren't available
- Privacy-sensitive environments — healthcare, legal, finance
- Edge computing and IoT with limited hardware
- RAG pipelines where retrieval accuracy determines output quality
- Anyone tired of managing vector database infrastructure
- Startups who can't afford Pinecone/Weaviate/ChromaDB costs
Where Sift doesn't fit
Sift is built for structured retrieval — finding the right document for a specific query. It's not a general-purpose semantic search engine. If you need "find me things vaguely related to this concept," embeddings are better for that. If you need "find me the exact right answer," Sift is better.
It also works best when you can define your domain. The signature-building process benefits from knowing what your documents look like. Generic web search across arbitrary content is not the target use case.
Try it
Sift is open source under MIT license. The code, the algorithm, and the full benchmark are on GitHub.
Get Sift
Open source. MIT license. Zero dependencies. Start retrieving in 5 minutes.
View on GitHub ↗ Need help integrating? →Why I Built a Retrieval Engine Without Embeddings
Everyone says embeddings are the answer to search and retrieval. I tried them. They got 25% accuracy on my data.
So I built something else.
The problem I was solving
I needed a retrieval engine for a real workload — hundreds of thousands of document pairs, queried in natural language. The kind of thing you'd normally throw a vector database at.
I tried embeddings first. Sentence transformers, vector similarity, the standard approach everyone recommends. On a benchmark of 175 real-world queries, the embedding approach found the right answer 25% of the time.
I tried hybrid retrieval next — embeddings plus BM25 keyword search. That got me to 33%.
Both numbers are unacceptable for any production system.
What went wrong with embeddings
Embeddings compress meaning into fixed-length vectors. That compression loses information. When your documents contain domain-specific language, abbreviations, technical terms, or structured data, the embedding model hasn't seen enough of your domain to represent it properly.
The vector similarity score tells you "these texts are kind of related." It doesn't tell you "this is the answer."
The fundamental problem: embeddings optimize for semantic similarity, not retrieval accuracy. "Similar" and "correct" are not the same thing.
When a support agent searches for "how to reset MFA token for enterprise SSO," the embedding model returns articles about password resets, MFA setup guides, SSO overview pages — all "similar" to the query, none of them the answer. The actual procedure document has different vocabulary, different structure, but it's the one the agent needs.
What I built instead
Sift takes a completely different approach. Instead of converting text to vectors, it builds deterministic signatures from the text itself. Think of it like a fingerprint — every document gets a unique structural identity based on what's actually in it.
When a query comes in, Sift doesn't compute similarity. It matches structural patterns. The result is either the right answer or nothing. No "sort of related" results cluttering the output.
The algorithm is fully documented in RECIPE.md on GitHub. No black box. You can read exactly how every signature is built and every match is made.
The numbers
| Metric | Sift | Embeddings | Hybrid |
|---|---|---|---|
| Accuracy (175 queries) | 100% | 25% | 33% |
| Avg response time | 15ms | ~200ms | ~350ms |
| Dependencies | 0 | 5–20+ | 10–30+ |
| GPU required | No | Usually | Usually |
| Works offline | Yes | Rarely | Rarely |
| RAM (904K pairs) | ~272MB | 2–8GB | 3–10GB |
Zero dependencies means zero dependencies. Sift runs on Python's standard library. No numpy, no torch, no transformers, no sentence-transformers, no FAISS, no ChromaDB. Import it and it works.
Who this is for
- Offline applications where cloud APIs aren't available
- Privacy-sensitive environments — healthcare, legal, finance
- Edge computing and IoT with limited hardware
- RAG pipelines where retrieval accuracy determines output quality
- Anyone tired of managing vector database infrastructure
- Startups who can't afford Pinecone/Weaviate/ChromaDB costs
Where Sift doesn't fit
Sift is built for structured retrieval — finding the right document for a specific query. It's not a general-purpose semantic search engine. If you need "find me things vaguely related to this concept," embeddings are better for that. If you need "find me the exact right answer," Sift is better.
It also works best when you can define your domain. The signature-building process benefits from knowing what your documents look like. Generic web search across arbitrary content is not the target use case.
Try it
Sift is open source under MIT license. The code, the algorithm, and the full benchmark are on GitHub.
Get Sift
Open source. MIT license. Zero dependencies. Start retrieving in 5 minutes.
View on GitHub ↗ Need help integrating? →