April 25, 2026

RAG Unpacked, Module 01: Building the Canonical Pattern from Scratch

Every serious LLM application eventually converges on the same architectural idea: before asking the model anything, retrieve the relevant information and hand it over as context. That pattern is RAG, and it shows up everywhere from internal knowledge bases to customer support bots to legal research tools. Most people use it through an abstraction layer like LangChain and move on.

I wanted to actually understand it, not just use it. So I built it from scratch across five modules, each one unpacking a different facet of the problem. This is module 01, the canonical pattern. By the end of module 05 the goal is a complete, honest picture of how RAG works, where it breaks, and what actually moves the needle in production.

Five modules. Five facets. One complete picture of how RAG works in the real world.

The Series at a Glance

• Module 01: Intro to RAG. The canonical pattern, built raw. You are here.

• Module 02: Graph RAG. Retrieval over a knowledge graph (Neo4j) for when structure beats similarity.

• Module 03: Vectorless RAG. BM25 and keyword search for when you do not need vectors at all.

• Module 04: Evaluating RAG. Faithfulness, context precision, answer relevance. Measuring what matters.

• Module 05: Advanced RAG. Reranking, query rewriting, hybrid search. The patterns that move the needle.

Why Build It Raw

RAG has a strange property. Each individual piece is almost embarrassingly simple: a function call to embed text, a database query, a chat completion. What people get stuck on is how the pieces compose. Abstractions like LangChain hide that composition, which is great for shipping and bad for understanding. So I went direct: OpenAI's SDK for embeddings and completions, Pinecone for the vector database, Python. Nothing in between.

The abstraction you skip is the abstraction you understand. When something goes wrong there is no mystery layer to debug. When you want to swap a component, the seam is obvious because no framework is deciding where it has to be.

Three Notebooks: Crawl, Walk, Run

The module splits into three notebooks, each building on the previous one.

Crawl starts with five hand-written four-dimensional vectors and one similarity query. No embedding model, no LLM, no real corpus. Just enough to see that the whole shape of RAG is: get vectors in, ask for the closest ones, get them back. When the query for 'almost pure cat with a tiny lean toward kitten' returns kitten, cat, tiger in that order, the geometry clicks.

Walk uses a toy 20-item movie catalog with 16-dimensional vectors and metadata. The focus is everything Pinecone lets you do once data is loaded: fetch by ID, filter queries, update values versus update metadata, delete by ID versus delete by filter, namespaces for multi-tenancy. Still no model. Keeping the embedding API out of the way keeps the focus on the operation itself.

Run drops the toys. A real corpus of Pinecone documentation gets embedded with OpenAI text-embedding-3-small and wired into a full retrieve-augment-generate loop with gpt-4o-mini. There is also a performance section most tutorials skip: ingesting vectors one at a time is slow because every upsert is a network round trip. Batching into groups of 100 and firing those batches in parallel with pool_threads=20 produced a 6x speedup on wall-clock time. Not because anything got smarter, just because we stopped throwing network latency away.

What I Actually Learned

• Getting started is genuinely easy. A working vector search in under ten minutes, no YAML, no Kafka, no Kubernetes. The free tier is real and sufficient for a learning module. The barrier between curious and shipping is lower than I expected.

• The model is cheap. The ergonomics are the cost. A full run of notebook 3 costs under a cent. What actually takes time is deciding: which embedding model, what chunk size, what top-k, what prompt template, when to rerank, when to rewrite the query. Most of the work in a production RAG system lives in those choices.

• Maintenance is a different animal. Index freshness requires a pipeline that re-embeds and upserts changed chunks. Embedding model deprecation means a full corpus re-ingest since old and new vectors are not in the same space. Quality regresses silently without a pinned evaluation set. Cost monitoring matters because one bad prompt in a loop can spend more in a weekend than the rest of the year.

Here is a one-page cheat sheet covering every Pinecone and OpenAI call used across the three notebooks.

Code and notebooks: github.com/allllc/rag-unpacked   |  Module 01 directly: 01-intro-to-rag

Next up: Module 02, GraphRAG, retrieval over a knowledge graph for when structure beats similarity.