Building Production-Ready Hybrid Search in RAG: Combining Semantic Search with BM25 Keyword Search

Modern AI applications are no longer powered by simple keyword matching. Production-grade Retrieval-Augmented Generation (RAG) systems use a combination of:

Semantic Vector Search
+
Keyword-Based Search

This approach is called:

Hybrid Search

Hybrid Search dramatically improves retrieval quality in AI applications like:

  • AI Chatbots
  • Enterprise Knowledge Search
  • Healthcare AI Systems
  • Resume Search Engines
  • Legal Document Assistants
  • Customer Support AI
  • Internal Company Search

In this article, we will build a production-style Hybrid Search system using:

  • FastAPI
  • Qdrant
  • Sentence Transformers
  • BM25
  • Python

Why Pure Vector Search Is Not Enough

Vector databases like Qdrant are excellent at understanding semantic meaning.

Example:

Query:

What is Pankaj's contact information?

Vector search can correctly retrieve:

Mobile: (+91) 1234567890

because it understands contextual meaning.

However, vector search struggles with:

  • Email addresses
  • Phone numbers
  • Invoice IDs
  • Employee IDs
  • Exact names
  • Codes
  • Technical keywords

Example:

What is Pankaj email id?

Pure semantic search may retrieve:

Professional summary

instead of:

Pankajthapa4@gmail.com

This is where Hybrid Search becomes essential.


What Is Hybrid Search?

Hybrid Search combines:

Search TypePurpose
Semantic Vector SearchUnderstand meaning/context
BM25 Keyword SearchExact keyword relevance

Production AI systems combine both scores to improve retrieval quality.

Architecture:

User Query

Generate Embedding

Vector Search (Qdrant)

BM25 Keyword Ranking

Combine Scores

Return Best Chunks

Send Context to LLM

This architecture is commonly used in:

  • Perplexity AI
  • Azure AI Search
  • Elasticsearch AI Search
  • Pinecone Hybrid Search
  • Weaviate Hybrid Search

Step 1 — Install Dependencies

Install required libraries:

pip install qdrant-client sentence-transformers rank-bm25 fastapi uvicorn

Step 2 — Initialize Embedding Model

We will use:

all-MiniLM-L6-v2

This model converts text into vector embeddings.

from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(
"all-MiniLM-L6-v2"
)

Step 3 — Initialize Qdrant

from qdrant_client import QdrantClient

qdrant_client = QdrantClient(
host="localhost",
port=6333
)

COLLECTION_NAME = "ai_documents"

Step 4 — Create BM25 Search Function

Install BM25 library:

pip install rank-bm25

Now create BM25 scoring logic:

from rank_bm25 import BM25Okapi


def calculate_bm25_score(
query: str,
documents: list[str]
):

tokenized_docs = [
doc.lower().split()
for doc in documents
]

bm25 = BM25Okapi(tokenized_docs)

tokenized_query = query.lower().split()

scores = bm25.get_scores(
tokenized_query
)

return scores

BM25 ranks documents based on keyword relevance.


Step 5 — Implement Hybrid Search

Now let’s combine:

Vector Score
+
BM25 Score
def search_qdrant(
query: str,
top_k: int = 3
):

query_embedding = embedding_model.encode(query)

results = qdrant_client.query_points(
collection_name=COLLECTION_NAME,
query=query_embedding.tolist(),
limit=top_k
)

texts = [
point.payload.get("text", "")
for point in results.points
]

bm25_scores = calculate_bm25_score(
query=query,
documents=texts
)

documents = []

for index, point in enumerate(results.points):

text = point.payload.get("text", "")

vector_score = point.score

keyword_score = bm25_scores[index]

hybrid_score = (
vector_score * 0.7
) + (
keyword_score * 0.3
)

documents.append({
"text": text,
"vector_score": round(vector_score, 4),
"bm25_score": round(float(keyword_score), 4),
"hybrid_score": round(float(hybrid_score), 4)
})

documents = sorted(
documents,
key=lambda x: x["hybrid_score"],
reverse=True
)

return documents

How Hybrid Ranking Works

Suppose the query is:

What is pankaj phone number?

Retrieved chunks:

ChunkVector ScoreBM25 Score
Resume Summary0.810.4
Phone Number Chunk0.764.8

Hybrid score formula:

hybrid_score = (
vector_score * 0.7
) + (
bm25_score * 0.3
)

Final ranking:

ChunkHybrid Score
Phone Number Chunk1.96 ✅
Resume Summary0.69

Now the correct chunk ranks highest.


Step 6 — Create FastAPI Endpoint

from fastapi import FastAPI

app = FastAPI()


@app.get("/search-docs")
async def search_docs(
query: str
):

results = search_qdrant(query)

return {
"query": query,
"results": results
}

Run server:

uvicorn app.main:app --reload

Step 7 — Test Hybrid Search

Test:

http://127.0.0.1:8000/search-docs?query=What is Himanshu phone number

Expected result:

{
"text": "Mobile: (+91) 7579414837",
"vector_score": 0.3015,
"bm25_score": 4.921,
"hybrid_score": 1.687
}

Why Hybrid Search Matters in Production

Pure semantic search is insufficient for enterprise systems.

Hybrid Search improves:

  • Exact information retrieval
  • Email search
  • Contact search
  • Invoice lookup
  • Healthcare IDs
  • Employee records
  • Legal document search

This becomes critical in:

  • Healthcare Systems
  • Banking Systems
  • Enterprise Knowledge Platforms
  • SaaS AI Applications

Production-Grade Hybrid Search Architecture

Modern enterprise RAG systems typically use:

FastAPI
Qdrant
BM25
Redis Cache
OpenAI/Groq
Rerankers

Advanced systems further add:

  • Cross-encoder reranking
  • Metadata filtering
  • Query rewriting
  • Multi-stage retrieval
  • Access control
  • Tenant isolation

Final Thoughts

Hybrid Search is one of the most important upgrades you can make to a RAG system.

By combining:

Semantic Understanding
+
Keyword Precision

you create retrieval systems that are:

  • More accurate
  • More production-ready
  • More scalable
  • More enterprise-friendly

If you are building modern AI applications, Hybrid Search should be considered a foundational architecture pattern rather than an optional enhancement.

Leave a Comment

Your email address will not be published. Required fields are marked *