Chapter 1 — What is RAG?

RAG (Retrieval-Augmented Generation)

RAG is an AI architecture where:

LLM + External Knowledge Base

work together.

Instead of relying only on model training data, the AI retrieves real-time context from documents before answering.

Basic Flow

User Question
↓
Convert Question to Embedding
↓
Search Vector Database
↓
Retrieve Relevant Chunks
↓
Send Context to LLM
↓
Generate Final Answer

Why RAG?

RAG solves:

Hallucinations
Outdated AI knowledge
Enterprise document search
Internal company knowledge retrieval
Large document understanding

Chapter 2 — Embeddings

What are Embeddings?

Embeddings are vector representations of text.

Example:

"What is AI?"
↓
[0.123, -0.567, 0.991 ...]

These vectors help find semantic similarity.

What You Learned

Sentence Transformers
Semantic Search
Vector Similarity
Cosine Similarity

Chapter 3 — Vector Database (Qdrant)

What is Qdrant?

Qdrant is a vector database used to store embeddings and metadata.

Your Stored Payload

{
  "text": "...",
  "file_name": "...",
  "page_number": 1,
  "document_id": "...",
  "tenant_id": "...",
  "uploaded_by": "..."
}

Why Metadata Matters

Metadata enables:

Multi-tenancy
Filtering
Citations
Security
Access control

Chapter 4 — Chunking

Why Chunking Exists

LLMs cannot process huge documents directly.

So documents are split into smaller chunks.

Fixed Chunking

Initial Approach

text[i:i+500]

Problem

Breaks sentences
Loses context
Poor retrieval quality

Recursive Chunking

Better Strategy

Using:

RecursiveCharacterTextSplitter

Benefits

Preserves sentence structure
Better semantic meaning
Better retrieval quality

Chapter 5 — Parent-Child Chunking

Concept

Store:

Small child chunks for retrieval
Large parent chunks for context

Flow

Search small chunks
↓
Return larger parent context
↓
LLM gets better understanding

Benefits

Better retrieval precision
Better answer quality
Reduced hallucination

Tradeoff

More context = higher token cost

Chapter 6 — Hybrid Search

What is Hybrid Search?

Combination of:

Vector Search + Keyword Search

Why Needed?

Vector search may fail for:

Exact names
IDs
Phone numbers
Medical codes

Hybrid search improves accuracy.

Chapter 7 — Reranking

Problem

Initial retrieval may return partially relevant chunks.

Solution

Reranker model:

BAAI/bge-reranker-base

Flow

Initial Retrieval
↓
Reranker scores results
↓
Best chunks moved to top

Benefit

Higher retrieval precision.

Chapter 8 — Query Rewriting

Problem

Users ask vague questions.

Example:

"What about his healthcare experience?"

AI Rewrites Query

"What is Himanshu Joshi's healthcare domain experience?"

Benefit

Improves retrieval quality significantly.

Chapter 9 — Redis Caching

Why Caching?

Repeated LLM calls are expensive.

Solution

Store responses in Redis.

Flow

User Query
↓
Check Redis Cache
↓
Cache Hit → Return Fast
↓
Cache Miss → Call LLM

Benefit

Lower latency
Lower cost
Better scalability

Chapter 10 — Conversation Memory

Goal

Maintain chat history.

You Learned

Session-based memory
Redis conversation storage
Context-aware conversations

Chapter 11 — Streaming Responses

Problem

Users wait too long for full answer.

Solution

Token streaming.

Flow

LLM generates tokens
↓
UI receives tokens live
↓
ChatGPT-style streaming

Chapter 12 — Agentic AI

Traditional RAG

Retrieve → Answer

Agentic AI

AI can:

Think
Plan
Use tools
Retry
Make decisions

CrewAI Concepts

Agents
Tasks
Crews
Tool Calling

Chapter 13 — Tool Calling

What is Tool Calling?

LLM can invoke external functions.

Example:

Search Qdrant
Call APIs
Read documents
Use calculator

Your First Agent

Qdrant search agent using CrewAI.

Chapter 14 — RAG Evaluation (RAGAS)

Biggest Learning

A RAG system that “looks good” may fail in production.

What is RAGAS?

Framework to evaluate RAG quality.

Metrics Learned

Faithfulness

Is answer grounded in context?

Context Precision

Were retrieved chunks useful?

Context Recall

Did retrieval find needed information?

Chapter 15 — Citation-Based Answers

Problem

Users ask:

"How do I trust this answer?"

Solution

Return sources with answers.

Example:

Answer:
Phone number is 7579414837.

Source:
Resume.pdf Page 1

Importance

Critical for:

Healthcare AI
Legal AI
Enterprise Search
Compliance systems

Chapter 16 — Context Compression

Retrieve large context
↓
remove unnecessary text
↓
send only useful context to LLM

Why it is needed

After parent-child retrieval, you may send large parent chunks to the LLM.

That improves answer quality, but also increases:

token cost
latency
noise
chance of confusing the LLM

So production systems try to send:

minimum useful context

not maximum context.

Simple example

User asks:

What is Himanshu phone number?

Retrieved parent chunk may contain:

Email, phone, LinkedIn, summary, skills, projects, education...

But LLM only needs:

Mobile: (+91) 7579414837

Context compression keeps only the useful part.

Production flow

User Query
↓
Hybrid Search
↓
Reranking
↓
Parent Context Expansion
↓
Context Compression
↓
LLM Final Answer

Common production techniques

1. Extractive compression

Keep only relevant sentences from retrieved chunks.

Example:

Question: phone number?
Keep sentence containing "Mobile"

Fast and cheaper.

2. LLM-based compression

Ask an LLM:

From this context, keep only information useful for answering the question.

More accurate, but adds cost.

3. Reranker-based compression

Use reranker scores to keep only top chunks/sentences.

Good production balance.

4. Token-budget compression

Limit context to a fixed token budget.

Example:

Only send max 3000 tokens to LLM

Production recommendation

For your project, start with:

extractive sentence compression

because it is:

simple
fast
cheap
easy to debug

Then later add LLM-based compression.

Key idea

Parent-child retrieval gives more context.

Context compression removes unnecessary context.

Together:

Parent-child retrieval = better understanding
Context compression = lower cost and less noise

Chapter 16 — Production Learnings

Major Realizations

RAG is not just prompting.

Modern AI systems involve:

Retrieval Engineering
Vector Databases
Search Architecture
Evaluation Pipelines
AI Observability
Distributed Systems
Cost Optimization
Context Engineering

Technologies You Worked With

Backend

FastAPI
Python AsyncIO
Redis
Qdrant

AI Stack

Groq
CrewAI
Sentence Transformers
RAGAS
LangChain

AI Techniques

Hybrid Search
Parent-Child Retrieval
Reranking
Query Rewriting
Citation-based Answers

Github Link : https://github.com/Pankajthapa4/production-ai-playground

Linkdin Profile : www.linkedin.com/in/pankaj-thapa-0ba85055