Chapter 1 — What is RAG?
RAG (Retrieval-Augmented Generation)
RAG is an AI architecture where:
LLM + External Knowledge Basework together.
Instead of relying only on model training data, the AI retrieves real-time context from documents before answering.
Basic Flow
User Question
↓
Convert Question to Embedding
↓
Search Vector Database
↓
Retrieve Relevant Chunks
↓
Send Context to LLM
↓
Generate Final AnswerWhy RAG?
RAG solves:
- Hallucinations
- Outdated AI knowledge
- Enterprise document search
- Internal company knowledge retrieval
- Large document understanding
Chapter 2 — Embeddings
What are Embeddings?
Embeddings are vector representations of text.
Example:
"What is AI?"
↓
[0.123, -0.567, 0.991 ...]These vectors help find semantic similarity.
What You Learned
- Sentence Transformers
- Semantic Search
- Vector Similarity
- Cosine Similarity
Chapter 3 — Vector Database (Qdrant)
What is Qdrant?
Qdrant is a vector database used to store embeddings and metadata.
Your Stored Payload
{
"text": "...",
"file_name": "...",
"page_number": 1,
"document_id": "...",
"tenant_id": "...",
"uploaded_by": "..."
}Why Metadata Matters
Metadata enables:
- Multi-tenancy
- Filtering
- Citations
- Security
- Access control
Chapter 4 — Chunking
Why Chunking Exists
LLMs cannot process huge documents directly.
So documents are split into smaller chunks.
Fixed Chunking
Initial Approach
text[i:i+500]Problem
- Breaks sentences
- Loses context
- Poor retrieval quality
Recursive Chunking
Better Strategy
Using:
RecursiveCharacterTextSplitterBenefits
- Preserves sentence structure
- Better semantic meaning
- Better retrieval quality
Chapter 5 — Parent-Child Chunking
Concept
Store:
Small child chunks for retrieval
Large parent chunks for contextFlow
Search small chunks
↓
Return larger parent context
↓
LLM gets better understandingBenefits
- Better retrieval precision
- Better answer quality
- Reduced hallucination
Tradeoff
More context = higher token costChapter 6 — Hybrid Search
What is Hybrid Search?
Combination of:
Vector Search + Keyword SearchWhy Needed?
Vector search may fail for:
- Exact names
- IDs
- Phone numbers
- Medical codes
Hybrid search improves accuracy.
Chapter 7 — Reranking
Problem
Initial retrieval may return partially relevant chunks.
Solution
Reranker model:
BAAI/bge-reranker-baseFlow
Initial Retrieval
↓
Reranker scores results
↓
Best chunks moved to topBenefit
Higher retrieval precision.
Chapter 8 — Query Rewriting
Problem
Users ask vague questions.
Example:
"What about his healthcare experience?"AI Rewrites Query
"What is Himanshu Joshi's healthcare domain experience?"Benefit
Improves retrieval quality significantly.
Chapter 9 — Redis Caching
Why Caching?
Repeated LLM calls are expensive.
Solution
Store responses in Redis.
Flow
User Query
↓
Check Redis Cache
↓
Cache Hit → Return Fast
↓
Cache Miss → Call LLMBenefit
- Lower latency
- Lower cost
- Better scalability
Chapter 10 — Conversation Memory
Goal
Maintain chat history.
You Learned
- Session-based memory
- Redis conversation storage
- Context-aware conversations
Chapter 11 — Streaming Responses
Problem
Users wait too long for full answer.
Solution
Token streaming.
Flow
LLM generates tokens
↓
UI receives tokens live
↓
ChatGPT-style streamingChapter 12 — Agentic AI
Traditional RAG
Retrieve → AnswerAgentic AI
AI can:
- Think
- Plan
- Use tools
- Retry
- Make decisions
CrewAI Concepts
- Agents
- Tasks
- Crews
- Tool Calling
Chapter 13 — Tool Calling
What is Tool Calling?
LLM can invoke external functions.
Example:
Search Qdrant
Call APIs
Read documents
Use calculatorYour First Agent
Qdrant search agent using CrewAI.
Chapter 14 — RAG Evaluation (RAGAS)
Biggest Learning
A RAG system that “looks good” may fail in production.
What is RAGAS?
Framework to evaluate RAG quality.
Metrics Learned
Faithfulness
Is answer grounded in context?Context Precision
Were retrieved chunks useful?Context Recall
Did retrieval find needed information?Chapter 15 — Citation-Based Answers
Problem
Users ask:
"How do I trust this answer?"Solution
Return sources with answers.
Example:
Answer:
Phone number is 7579414837.
Source:
Resume.pdf Page 1Importance
Critical for:
- Healthcare AI
- Legal AI
- Enterprise Search
- Compliance systems
Chapter 16 — Context Compression
Retrieve large context
↓
remove unnecessary text
↓
send only useful context to LLMWhy it is needed
After parent-child retrieval, you may send large parent chunks to the LLM.
That improves answer quality, but also increases:
token cost
latency
noise
chance of confusing the LLMSo production systems try to send:
minimum useful contextnot maximum context.
Simple example
User asks:
What is Himanshu phone number?Retrieved parent chunk may contain:
Email, phone, LinkedIn, summary, skills, projects, education...But LLM only needs:
Mobile: (+91) 7579414837Context compression keeps only the useful part.
Production flow
User Query
↓
Hybrid Search
↓
Reranking
↓
Parent Context Expansion
↓
Context Compression
↓
LLM Final AnswerCommon production techniques
1. Extractive compression
Keep only relevant sentences from retrieved chunks.
Example:
Question: phone number?
Keep sentence containing "Mobile"Fast and cheaper.
2. LLM-based compression
Ask an LLM:
From this context, keep only information useful for answering the question.More accurate, but adds cost.
3. Reranker-based compression
Use reranker scores to keep only top chunks/sentences.
Good production balance.
4. Token-budget compression
Limit context to a fixed token budget.
Example:
Only send max 3000 tokens to LLMProduction recommendation
For your project, start with:
extractive sentence compressionbecause it is:
simple
fast
cheap
easy to debugThen later add LLM-based compression.
Key idea
Parent-child retrieval gives more context.
Context compression removes unnecessary context.
Together:
Parent-child retrieval = better understanding
Context compression = lower cost and less noiseChapter 16 — Production Learnings
Major Realizations
RAG is not just prompting.
Modern AI systems involve:
- Retrieval Engineering
- Vector Databases
- Search Architecture
- Evaluation Pipelines
- AI Observability
- Distributed Systems
- Cost Optimization
- Context Engineering
Technologies You Worked With
Backend
- FastAPI
- Python AsyncIO
- Redis
- Qdrant
AI Stack
- Groq
- CrewAI
- Sentence Transformers
- RAGAS
- LangChain
AI Techniques
- Hybrid Search
- Parent-Child Retrieval
- Reranking
- Query Rewriting
- Citation-based Answers
Github Link : https://github.com/Pankajthapa4/production-ai-playground
Linkdin Profile : www.linkedin.com/in/pankaj-thapa-0ba85055
