Ep #90: Vector Databases 101 (Part 2): Optimizing RAG for Production
Moving beyond basic RAG: Advanced chunking, Hybrid Search, and how to stop your AI from hallucinating.
Breaking the complex System Design Components
By Amit Raghuvanshi | The Architect’s Notebook
🗓️ Mar 12, 2026 · Deep Dive ·
The Chunking Dilemma
In Part 1, we built a basic RAG pipeline. But if you deploy that today, you will hit a wall. The #1 cause of bad RAG performance isn’t the model—it’s the Chunking.
The Problem: Consider this legal text:
“Section 3.2: The contractor shall deliver the final report by December 31, 2024.
Payment terms are outlined in Appendix B. The report must include all findings
from the site inspection conducted in October.”If you naively split at 50 characters:
Chunk 1: “Section 3.2: The contractor shall deliver the rep”
Chunk 2: “ort by Dec 31. Payment terms are in Appendix B.”
Both chunks are semantically broken. The vector for Chunk 1 won’t know about the date. The vector for Chunk 2 won’t know who the “contractor” is.
Smart Chunking Strategies
Semantic Chunking: Don’t split by character count. Split by sentences or paragraphs. Use an embedding model to calculate similarity between sentences; if the topic shifts, start a new chunk.
Parent-Child Chunking: Store small chunks (for precise search) but link them to a larger “Parent” document. When you find the small chunk, you feed the Parent to the LLM for better context.
# Small chunk for retrieval
small_chunk = “Payment terms are outlined in Appendix B.”
# Link to parent document ID for full context
parent_doc_id = “contract_2024_section3”Overlapping Windows: Always maintain 10-20% overlap between chunks to prevent information loss at boundaries:
Chunk 1: “...contractor shall deliver the final report by December 31, 2024.”
Chunk 2: “by December 31, 2024. Payment terms are outlined in Appendix B...”📖 Before You Close This Tab
If you found today’s breakdown valuable, you should look at my new book: The Architecture of Neural Scale: Volume 1(A) - Foundations of AI Systems.
The tech world is currently obsessed with treating LLMs as infinite black boxes. You send a prompt, you get a response. But what happens when that abstraction leaks? What happens when latency spikes or your cloud bill explodes?
This book strips away the vendor magic. It is a ground-up engineering guide to the physical reality of AI. We start at the mathematical foundations of neural networks and plunge straight into the “metal layer.” You will learn how to calculate VRAM, navigate memory bottlenecks, and design the infrastructure required to serve billions of parameters without your systems catching fire. No academic fluff. Just production engineering.
The 10% launch discount expires this weekend. If you are ready to graduate from API wrappers to actual AI architecture, grab your copy below.
Part 2: Vectors are Bad at Keywords
If a user searches for a specific part number: “Part-XH-992”. Vector search might return “Part-XH-993” because they are semantically identical (both are “part numbers”). But for a mechanic, that is a catastrophic failure.
The Fix: Hybrid Search Don’t rely on vectors alone.
Run a Keyword Search (BM25) to catch exact matches (”XH-992”).
Run a Vector Search to catch semantic intent (”replacement part”).
Combine results using Reciprocal Rank Fusion (RRF).
This gives you the best of both worlds: the precision of ElasticSearch and the understanding of ChatGPT.
🔒 Subscribe to read Evaluation, Cost & Future RAG
How do you know if your RAG system is good? It fails silently. The user gets an answer, but it’s wrong. You need observability.
In the rest of this deep dive, we will cover:
Choosing the Stack: Can I use Postgres or dedicated VectorDB
Distance Metrcis: What are different options to calculate distance.
RAG Evaluation: How to use “LLM-as-a-Judge” to grade your system’s accuracy automatically.
Query/Cost Optimization: Techniques to reduce embedding costs by 90%.
The Future: Agentic RAG and GraphRAG—where the LLM decides what to search for.


