AI LearningAI RAG Learning Glossary & Practical JourneyAI Learning


Chapter 1 — What is RAG?

RAG (Retrieval-Augmented Generation)

RAG is an AI architecture where:

LLM + External Knowledge Base

work together.

Instead of relying only on model training data, the AI retrieves real-time context from documents before answering.

Basic Flow

User Question

Convert Question to Embedding

Search Vector Database

Retrieve Relevant Chunks

Send Context to LLM

Generate Final Answer

Why RAG?

RAG solves:

  • Hallucinations
  • Outdated AI knowledge
  • Enterprise document search
  • Internal company knowledge retrieval
  • Large document understanding

Chapter 2 — Embeddings

What are Embeddings?

Embeddings are vector representations of text.

Example:

"What is AI?"

[0.123, -0.567, 0.991 ...]

These vectors help find semantic similarity.

What You Learned

  • Sentence Transformers
  • Semantic Search
  • Vector Similarity
  • Cosine Similarity

Chapter 3 — Vector Database (Qdrant)

What is Qdrant?

Qdrant is a vector database used to store embeddings and metadata.

Your Stored Payload

{
"text": "...",
"file_name": "...",
"page_number": 1,
"document_id": "...",
"tenant_id": "...",
"uploaded_by": "..."
}

Why Metadata Matters

Metadata enables:

  • Multi-tenancy
  • Filtering
  • Citations
  • Security
  • Access control

Chapter 4 — Chunking

Why Chunking Exists

LLMs cannot process huge documents directly.

So documents are split into smaller chunks.


Fixed Chunking

Initial Approach

text[i:i+500]

Problem

  • Breaks sentences
  • Loses context
  • Poor retrieval quality

Recursive Chunking

Better Strategy

Using:

RecursiveCharacterTextSplitter

Benefits

  • Preserves sentence structure
  • Better semantic meaning
  • Better retrieval quality

Chapter 5 — Parent-Child Chunking

Concept

Store:

Small child chunks for retrieval
Large parent chunks for context

Flow

Search small chunks

Return larger parent context

LLM gets better understanding

Benefits

  • Better retrieval precision
  • Better answer quality
  • Reduced hallucination

Tradeoff

More context = higher token cost

Chapter 6 — Hybrid Search

What is Hybrid Search?

Combination of:

Vector Search + Keyword Search

Why Needed?

Vector search may fail for:

  • Exact names
  • IDs
  • Phone numbers
  • Medical codes

Hybrid search improves accuracy.


Chapter 7 — Reranking

Problem

Initial retrieval may return partially relevant chunks.

Solution

Reranker model:

BAAI/bge-reranker-base

Flow

Initial Retrieval

Reranker scores results

Best chunks moved to top

Benefit

Higher retrieval precision.


Chapter 8 — Query Rewriting

Problem

Users ask vague questions.

Example:

"What about his healthcare experience?"

AI Rewrites Query

"What is Himanshu Joshi's healthcare domain experience?"

Benefit

Improves retrieval quality significantly.


Chapter 9 — Redis Caching

Why Caching?

Repeated LLM calls are expensive.

Solution

Store responses in Redis.

Flow

User Query

Check Redis Cache

Cache Hit → Return Fast

Cache Miss → Call LLM

Benefit

  • Lower latency
  • Lower cost
  • Better scalability

Chapter 10 — Conversation Memory

Goal

Maintain chat history.

You Learned

  • Session-based memory
  • Redis conversation storage
  • Context-aware conversations

Chapter 11 — Streaming Responses

Problem

Users wait too long for full answer.

Solution

Token streaming.

Flow

LLM generates tokens

UI receives tokens live

ChatGPT-style streaming

Chapter 12 — Agentic AI

Traditional RAG

Retrieve → Answer

Agentic AI

AI can:

  • Think
  • Plan
  • Use tools
  • Retry
  • Make decisions

CrewAI Concepts

  • Agents
  • Tasks
  • Crews
  • Tool Calling

Chapter 13 — Tool Calling

What is Tool Calling?

LLM can invoke external functions.

Example:

Search Qdrant
Call APIs
Read documents
Use calculator

Your First Agent

Qdrant search agent using CrewAI.


Chapter 14 — RAG Evaluation (RAGAS)

Biggest Learning

A RAG system that “looks good” may fail in production.

What is RAGAS?

Framework to evaluate RAG quality.

Metrics Learned

Faithfulness

Is answer grounded in context?

Context Precision

Were retrieved chunks useful?

Context Recall

Did retrieval find needed information?

Chapter 15 — Citation-Based Answers

Problem

Users ask:

"How do I trust this answer?"

Solution

Return sources with answers.

Example:

Answer:
Phone number is 7579414837.

Source:
Resume.pdf Page 1

Importance

Critical for:

  • Healthcare AI
  • Legal AI
  • Enterprise Search
  • Compliance systems

Chapter 16 — Context Compression

Retrieve large context

remove unnecessary text

send only useful context to LLM

Why it is needed

After parent-child retrieval, you may send large parent chunks to the LLM.

That improves answer quality, but also increases:

token cost
latency
noise
chance of confusing the LLM

So production systems try to send:

minimum useful context

not maximum context.

Simple example

User asks:

What is Himanshu phone number?

Retrieved parent chunk may contain:

Email, phone, LinkedIn, summary, skills, projects, education...

But LLM only needs:

Mobile: (+91) 7579414837

Context compression keeps only the useful part.

Production flow

User Query

Hybrid Search

Reranking

Parent Context Expansion

Context Compression

LLM Final Answer

Common production techniques

1. Extractive compression

Keep only relevant sentences from retrieved chunks.

Example:

Question: phone number?
Keep sentence containing "Mobile"

Fast and cheaper.

2. LLM-based compression

Ask an LLM:

From this context, keep only information useful for answering the question.

More accurate, but adds cost.

3. Reranker-based compression

Use reranker scores to keep only top chunks/sentences.

Good production balance.

4. Token-budget compression

Limit context to a fixed token budget.

Example:

Only send max 3000 tokens to LLM

Production recommendation

For your project, start with:

extractive sentence compression

because it is:

simple
fast
cheap
easy to debug

Then later add LLM-based compression.

Key idea

Parent-child retrieval gives more context.

Context compression removes unnecessary context.

Together:

Parent-child retrieval = better understanding
Context compression = lower cost and less noise

Chapter 16 — Production Learnings

Major Realizations

RAG is not just prompting.

Modern AI systems involve:

  • Retrieval Engineering
  • Vector Databases
  • Search Architecture
  • Evaluation Pipelines
  • AI Observability
  • Distributed Systems
  • Cost Optimization
  • Context Engineering

Technologies You Worked With

Backend

  • FastAPI
  • Python AsyncIO
  • Redis
  • Qdrant

AI Stack

  • Groq
  • CrewAI
  • Sentence Transformers
  • RAGAS
  • LangChain

AI Techniques

  • Hybrid Search
  • Parent-Child Retrieval
  • Reranking
  • Query Rewriting
  • Citation-based Answers

Github Link : https://github.com/Pankajthapa4/production-ai-playground

Linkdin Profile : www.linkedin.com/in/pankaj-thapa-0ba85055

Leave a Comment

Your email address will not be published. Required fields are marked *