Modern AI applications are no longer powered by simple keyword matching. Production-grade Retrieval-Augmented Generation (RAG) systems use a combination of:
Semantic Vector Search
+
Keyword-Based SearchThis approach is called:
Hybrid SearchHybrid Search dramatically improves retrieval quality in AI applications like:
- AI Chatbots
- Enterprise Knowledge Search
- Healthcare AI Systems
- Resume Search Engines
- Legal Document Assistants
- Customer Support AI
- Internal Company Search
In this article, we will build a production-style Hybrid Search system using:
- FastAPI
- Qdrant
- Sentence Transformers
- BM25
- Python
Why Pure Vector Search Is Not Enough
Vector databases like Qdrant are excellent at understanding semantic meaning.
Example:
Query:
What is Pankaj's contact information?Vector search can correctly retrieve:
Mobile: (+91) 1234567890because it understands contextual meaning.
However, vector search struggles with:
- Email addresses
- Phone numbers
- Invoice IDs
- Employee IDs
- Exact names
- Codes
- Technical keywords
Example:
What is Pankaj email id?Pure semantic search may retrieve:
Professional summaryinstead of:
Pankajthapa4@gmail.comThis is where Hybrid Search becomes essential.
What Is Hybrid Search?
Hybrid Search combines:
| Search Type | Purpose |
|---|---|
| Semantic Vector Search | Understand meaning/context |
| BM25 Keyword Search | Exact keyword relevance |
Production AI systems combine both scores to improve retrieval quality.
Architecture:
User Query
↓
Generate Embedding
↓
Vector Search (Qdrant)
↓
BM25 Keyword Ranking
↓
Combine Scores
↓
Return Best Chunks
↓
Send Context to LLMThis architecture is commonly used in:
- Perplexity AI
- Azure AI Search
- Elasticsearch AI Search
- Pinecone Hybrid Search
- Weaviate Hybrid Search
Step 1 — Install Dependencies
Install required libraries:
pip install qdrant-client sentence-transformers rank-bm25 fastapi uvicornStep 2 — Initialize Embedding Model
We will use:
all-MiniLM-L6-v2This model converts text into vector embeddings.
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(
"all-MiniLM-L6-v2"
)Step 3 — Initialize Qdrant
from qdrant_client import QdrantClient
qdrant_client = QdrantClient(
host="localhost",
port=6333
)
COLLECTION_NAME = "ai_documents"Step 4 — Create BM25 Search Function
Install BM25 library:
pip install rank-bm25Now create BM25 scoring logic:
from rank_bm25 import BM25Okapi
def calculate_bm25_score(
query: str,
documents: list[str]
):
tokenized_docs = [
doc.lower().split()
for doc in documents
]
bm25 = BM25Okapi(tokenized_docs)
tokenized_query = query.lower().split()
scores = bm25.get_scores(
tokenized_query
)
return scoresBM25 ranks documents based on keyword relevance.
Step 5 — Implement Hybrid Search
Now let’s combine:
Vector Score
+
BM25 Scoredef search_qdrant(
query: str,
top_k: int = 3
):
query_embedding = embedding_model.encode(query)
results = qdrant_client.query_points(
collection_name=COLLECTION_NAME,
query=query_embedding.tolist(),
limit=top_k
)
texts = [
point.payload.get("text", "")
for point in results.points
]
bm25_scores = calculate_bm25_score(
query=query,
documents=texts
)
documents = []
for index, point in enumerate(results.points):
text = point.payload.get("text", "")
vector_score = point.score
keyword_score = bm25_scores[index]
hybrid_score = (
vector_score * 0.7
) + (
keyword_score * 0.3
)
documents.append({
"text": text,
"vector_score": round(vector_score, 4),
"bm25_score": round(float(keyword_score), 4),
"hybrid_score": round(float(hybrid_score), 4)
})
documents = sorted(
documents,
key=lambda x: x["hybrid_score"],
reverse=True
)
return documentsHow Hybrid Ranking Works
Suppose the query is:
What is pankaj phone number?Retrieved chunks:
| Chunk | Vector Score | BM25 Score |
|---|---|---|
| Resume Summary | 0.81 | 0.4 |
| Phone Number Chunk | 0.76 | 4.8 |
Hybrid score formula:
hybrid_score = (
vector_score * 0.7
) + (
bm25_score * 0.3
)Final ranking:
| Chunk | Hybrid Score |
|---|---|
| Phone Number Chunk | 1.96 ✅ |
| Resume Summary | 0.69 |
Now the correct chunk ranks highest.
Step 6 — Create FastAPI Endpoint
from fastapi import FastAPI
app = FastAPI()
@app.get("/search-docs")
async def search_docs(
query: str
):
results = search_qdrant(query)
return {
"query": query,
"results": results
}Run server:
uvicorn app.main:app --reloadStep 7 — Test Hybrid Search
Test:
http://127.0.0.1:8000/search-docs?query=What is Himanshu phone numberExpected result:
{
"text": "Mobile: (+91) 7579414837",
"vector_score": 0.3015,
"bm25_score": 4.921,
"hybrid_score": 1.687
}Why Hybrid Search Matters in Production
Pure semantic search is insufficient for enterprise systems.
Hybrid Search improves:
- Exact information retrieval
- Email search
- Contact search
- Invoice lookup
- Healthcare IDs
- Employee records
- Legal document search
This becomes critical in:
- Healthcare Systems
- Banking Systems
- Enterprise Knowledge Platforms
- SaaS AI Applications
Production-Grade Hybrid Search Architecture
Modern enterprise RAG systems typically use:
FastAPI
Qdrant
BM25
Redis Cache
OpenAI/Groq
RerankersAdvanced systems further add:
- Cross-encoder reranking
- Metadata filtering
- Query rewriting
- Multi-stage retrieval
- Access control
- Tenant isolation
Final Thoughts
Hybrid Search is one of the most important upgrades you can make to a RAG system.
By combining:
Semantic Understanding
+
Keyword Precisionyou create retrieval systems that are:
- More accurate
- More production-ready
- More scalable
- More enterprise-friendly
If you are building modern AI applications, Hybrid Search should be considered a foundational architecture pattern rather than an optional enhancement.
