Retrieval-Augmented Generation (RAG): Grounding AI on Your Data
RAG fetches the right facts first and hands them to the LLM, so its answer is based on your documents — not just its training.
What you will learn
- Explain why RAG reduces hallucinations
- Describe the retrieve-then-generate pipeline
- Define embeddings, vector search and a vector database
The problem RAG solves
From the last two lessons you know two limits of LLMs: they hallucinate, and they only know what was in their training data (so nothing recent and nothing private — like your company handbook). Retrieval-Augmented Generation (RAG) fixes both, and it is how most real AI products actually work.
The idea in one line: find the relevant facts first, then ask the LLM to answer using them. Instead of relying on the model’s memory, you hand it the source text in the prompt.
The RAG pipeline
- Prepare your documents (once): split them into small chunks and turn each chunk into an embedding (a vector of numbers capturing its meaning, from the NLP lesson). Store these in a vector database.
- On a question: turn the question into an embedding too.
- Retrieve: use vector search to find the chunks whose embeddings are closest to the question’s — i.e. the most relevant passages.
- Augment: paste those chunks into the prompt as context.
- Generate: ask the LLM to answer using only that context.
# RAG in pseudo-code (the shape of every RAG system)
question = "What is our refund window?"
chunks = vector_db.search(question, top_k=3) # 1-3) retrieve relevant text
prompt = f'''Answer using ONLY the context below.
Context:
{chunks}
Question: {question}''' # 4) augment
answer = llm.generate(prompt) # 5) generate
print(answer)Note: Output: Our refund window is 30 days from purchase. (per the company handbook) The model did not “know” the refund policy — it was never trained on it. RAG retrieved the handbook chunk and put it in the prompt, so the answer is grounded in your real document.
The new words, simply
| Term | Plain meaning |
|---|---|
| Embedding | A list of numbers representing a piece of text’s meaning |
| Vector database | A store that holds embeddings and finds the closest ones fast |
| Vector search | Finding the chunks most similar in meaning to the question |
| Chunk | A small slice of a document (e.g. a paragraph) |
| Grounding | Basing the answer on supplied facts rather than the model’s memory |
Notice RAG reuses the embeddings idea straight from the NLP lesson: “closest meaning” is just “closest vectors”, the same distance maths that put king near queen.
Watch out: RAG does not retrain the model and does not change its weights. It only changes the prompt — by inserting retrieved facts at question time. That is why you can update your knowledge base instantly without any new training.
Tip: Why teams love RAG: it grounds answers in your up-to-date, private data, cuts hallucinations (the facts are right there in the prompt), and lets the AI cite its source — all without the cost of training a model.
Q. How does RAG reduce hallucinations and add fresh knowledge?
✍️ Practice
- Write the 5-step RAG pipeline from memory in your own words.
- For a cooking-recipes assistant, name what the documents, the chunks, and a typical question would be.
🏠 Homework
- Describe one task at your school or workplace where RAG (answering from your own documents) would beat a plain chatbot, and say why.