Ranking in RAG: Easy Python Example & Comparison to Normal RAG DoonProgramming

If you’re new to the world of AI and large language models, the idea of ranking in Retrieval‑Augmented Generation (RAG) might sound a little intimidating. Don’t worry: this blog post will explain it in simple, clear terms — as though teaching kids — and by the end you’ll understand:

What ranking in RAG means
The different types of ranking or re-ranking used in RAG systems
A code example in Python to illustrate how ranking is done
A comparison between a “normal” RAG pipeline and a RAG pipeline with ranking

Let’s jump in!

What is RAG?

First, let’s set the scene. What is RAG?

The term Retrieval-Augmented Generation (RAG) refers to a technique where a large language model (LLM) is helped by a retrieval system:

The system retrieves relevant documents or text chunks from some external knowledge base.
Then the model uses that retrieved information plus the query to generate an answer or response.
The key benefit: the LLM can leverage up-to-date facts or domain-specific content without needing to be entirely retrained.

In simple terms: you ask a question → the system fetches helpful documents → the model writes an answer informed by those documents.

What is Ranking in RAG?

Now that we know what RAG is, what do we mean by ranking (or often re-ranking) in that context?

In a retrieval pipeline of RAG, the system generally retrieves many candidate documents (or chunks) in response to the query. But not all retrieved items are equally good. So:

Ranking means sorting / ordering those retrieved candidates by how relevant they are to the query.
Re-ranking means doing a second pass (often using a more powerful model) to refine the relevance ordering of those retrieved items.

Why is this important? Because if you feed many mediocre documents to the LLM, it might generate a fuzzy or incorrect answer. Better inputs → better generation.

As one article explains:

Reranking in RAG… refers to the process of reordering or refining a set of initially retrieved documents based on their relevance to a user’s query.

Types of Ranking in RAG

Let’s break down some common types or methods of ranking / re-ranking in RAG pipelines:

Type	What it does	Notes
Initial retrieval ranking	The first set of documents returned by a retriever (e.g., BM25, vector search) are ordered by a basic score.	Fast but may not be super precise.
Re-ranking (second pass ranking)	A stronger model (e.g., cross-encoder, BERT-based) takes the top candidates and gives refined scores and ordering.	Improves relevance but adds cost/time.
Contextual / list-wise ranking	Considering not just each document alone but how they work together (i.e., list dependencies)	More advanced and may be used for complex queries.
Selection / dynamic ranking	Instead of fixed top-K, dynamically selecting passages based on ranking or some cutoff. Recent research shows “selection” sometimes replaces ranking.	Emerging area.

Code Implementation in Python for Ranking in RAG

Here’s a beginner‐friendly Python snippet showing how you could implement ranking (re-ranking) in a simple RAG pipeline. The aim is to illustrate the concept rather than provide a full production system.

# Install (if needed) e.g.:
# pip install sentence-transformers transformers faiss-cpu

from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np

# 1. Load embedding model for retrieval
embed_model = SentenceTransformer('all-mpnet-base-v2')

# 2. Suppose we have some documents (chunks)
docs = [
    "The history of pizza dates back to ancient times in Italy.",
    "Python is a programming language used for AI and web dev.",
    "Re-ranking in RAG improves retrieval quality by ordering better documents.",
    "Benefits of RAG include up-to-date info and domain specific knowledge."
]
doc_embeddings = embed_model.encode(docs, convert_to_tensor=True)

# 3. Given a query
query = "What is re-ranking in RAG?"
query_embedding = embed_model.encode(query, convert_to_tensor=True)

# 4. Initial retrieval: compute cosine similarities
cosine_scores = torch.nn.functional.cosine_similarity(query_embedding, doc_embeddings)
top_initial_k = 3
topk_idx = torch.topk(cosine_scores, top_initial_k).indices.tolist()

initial_candidates = [(docs[i], float(cosine_scores[i])) for i in topk_idx]
print("Initial candidates:", initial_candidates)

# 5. Re-ranking: Load a cross-encoder (query + doc) for scoring
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/ms-marco-MiniLM-L-12-v2')
model = AutoModelForSequenceClassification.from_pretrained('cross-encoder/ms-marco-MiniLM-L-12-v2')
model.eval()

reranked = []
for doc, _score in initial_candidates:
    inputs = tokenizer(query, doc, return_tensors='pt', truncation=True, padding=True)
    with torch.no_grad():
        logits = model(**inputs).logits
    score = logits.squeeze().item()
    reranked.append((doc, score))

# Sort by re-ranked score
reranked_sorted = sorted(reranked, key=lambda x: x[1], reverse=True)
print("Re-ranked top documents:", reranked_sorted)

What this code does:

We embed documents and the query using a sentence embedding model (initial retrieval).
We pick the top K documents by cosine similarity.
Then we run a cross-encoder model (query + doc pair) to get a more accurate relevance score (re-ranking).
Finally we sort by this refined score and pick the best document(s) for the generator context.

This matches the described pattern: retrieval → ranking → generation.

Comparison: Normal RAG vs RAG with Ranking

Let’s compare two pipelines to understand the difference:

Normal RAG (without explicit ranking layer)

Retrieve top K documents using embedding or BM25.
Directly pass those documents + query into the LLM for generation.
Pros: simpler, faster.
Cons: may include less relevant documents → lower answer quality, more risk of noise and hallucination.

RAG with Ranking / Re-Ranking

Retrieve top N documents (N > K) to ensure good recall.
Re-rank those N documents with a stronger model to pick top K best.
Pass top K to the LLM for generation.
Pros: improved relevance, better context for LLM → better answers.
Cons: Slightly more complex, more computation/time.

In short: adding a ranking step helps the RAG pipeline make smarter choices about which documents to give the model, leading to better responses.

Why Ranking Matters (For Beginners)

Here are some intuitive reasons why ranking is important in RAG (simple terms):

Imagine you ask a question and the system pulls ten documents, but 7 of them are only loosely related. The model will generate from messy input → the answer may be fuzzy.
If you instead pick the top 3 highly relevant ones, the model has focused information → clearer answer.
Re-ranking helps pick the best of the retrieved rather than just the first retrieved.
Especially for large corpora and open‐domain queries, the initial retrieval may bring in noisy / marginal results; ranking cleans that up.

Summary & Key Takeaways

RAG = retrieval + generation.
Ranking (especially re‐ranking) = ordering retrieved candidates by relevance so the LLM gets better input.
There are different types of ranking (initial vs re‐ranking vs selection).
A simple Python example shows how you could implement re-ranking.
Comparing normal RAG vs RAG with ranking shows the benefit: better relevance = better answers.
For beginners: think of ranking as “which documents should the AI look at first” — choosing better ones means the answer is stronger.

Looking Ahead: Where To Go From Here

If you’d like to dive deeper, here are some ideas:

Try implementing a full RAG pipeline (with retrieval, ranking, generation) using a library like LangChain or LlamaIndex.
Explore trade-offs: how many initial retrieved docs? how many to re-rank? what ranking model to use?
Explore domain-specific ranking: when your corpus is specialized (legal, medical), ranking becomes even more important.
Monitor metrics: how much does ranking improve answer accuracy, relevance, user satisfaction?

Generative AI 360°: Practical Guide to ChatGPT, Midjourney & AI Tools to Boost Productivity & Creativity | For Professionals, Marketers & Entrepreneurs