Language & Generative AI›Extra· 35 min read

Retrieval-Augmented Generation (RAG): Grounding AI on Your Data

RAG fetches the right facts first and hands them to the LLM, so its answer is based on your documents — not just its training.

What you will learn

Explain why RAG reduces hallucinations
Describe the retrieve-then-generate pipeline
Define embeddings, vector search and a vector database

The problem RAG solves

From the last two lessons you know two limits of LLMs: they hallucinate, and they only know what was in their training data (so nothing recent and nothing private — like your company handbook). Retrieval-Augmented Generation (RAG) fixes both, and it is how most real AI products actually work.

The idea in one line: find the relevant facts first, then ask the LLM to answer using them. Instead of relying on the model’s memory, you hand it the source text in the prompt.

The RAG pipeline

Prepare your documents (once): split them into small chunks and turn each chunk into an embedding (a vector of numbers capturing its meaning, from the NLP lesson). Store these in a vector database.
On a question: turn the question into an embedding too.
Retrieve: use vector search to find the chunks whose embeddings are closest to the question’s — i.e. the most relevant passages.
Augment: paste those chunks into the prompt as context.
Generate: ask the LLM to answer using only that context.

Retrieve relevant chunks, paste them in, then let the LLM answer

# RAG in pseudo-code (the shape of every RAG system)
question = "What is our refund window?"

chunks = vector_db.search(question, top_k=3)   # 1-3) retrieve relevant text

prompt = f'''Answer using ONLY the context below.
Context:
{chunks}

Question: {question}'''                          # 4) augment

answer = llm.generate(prompt)                    # 5) generate
print(answer)

Note: Output: Our refund window is 30 days from purchase. (per the company handbook) The model did not “know” the refund policy — it was never trained on it. RAG retrieved the handbook chunk and put it in the prompt, so the answer is grounded in your real document.

The new words, simply

Term	Plain meaning
Embedding	A list of numbers representing a piece of text’s meaning
Vector database	A store that holds embeddings and finds the closest ones fast
Vector search	Finding the chunks most similar in meaning to the question
Chunk	A small slice of a document (e.g. a paragraph)
Grounding	Basing the answer on supplied facts rather than the model’s memory

Notice RAG reuses the embeddings idea straight from the NLP lesson: “closest meaning” is just “closest vectors”, the same distance maths that put king near queen.

Watch out: RAG does not retrain the model and does not change its weights. It only changes the prompt — by inserting retrieved facts at question time. That is why you can update your knowledge base instantly without any new training.

Tip: Why teams love RAG: it grounds answers in your up-to-date, private data, cuts hallucinations (the facts are right there in the prompt), and lets the AI cite its source — all without the cost of training a model.

Q. How does RAG reduce hallucinations and add fresh knowledge?

Answer: RAG retrieves the most relevant chunks (via vector search over embeddings) and puts them in the prompt. The LLM then answers from those grounded facts — no retraining needed.

✍️ Practice

Write the 5-step RAG pipeline from memory in your own words.
For a cooking-recipes assistant, name what the documents, the chunks, and a typical question would be.

🏠 Homework

Describe one task at your school or workplace where RAG (answering from your own documents) would beat a plain chatbot, and say why.