How LLMs Really Work: Tokens, Context & Hallucinations
Under the hood an LLM splits text into tokens, predicts the next one from a huge learned pattern, and only “knows” what fits in its context window.
What you will learn
- Explain tokens and next-token prediction concretely
- Define training vs inference and the context window
- Explain why hallucinations happen and how to reduce them
Everything is tokens
You met the headline idea already — an LLM (Large Language Model) generates text by predicting the next word. Now the real detail. LLMs do not work in whole words; they work in tokens — chunks of text, often a word or a word-piece. Before anything happens, your text is chopped into tokens and each token is turned into a number (an embedding, from the last lesson).
# How text is split into tokens (illustrative — real tokenizers differ)
text = "Chatbots are helpful"
tokens = ["Chat", "bots", " are", " helpful"] # ~4 tokens
print("token count:", len(tokens))
# A rough rule of thumb: ~100 tokens is about 75 English words.Note: Output: token count: 4 “Chatbots” split into two tokens (“Chat” + “bots”). This is why long documents cost more and why models charge “per token”. Everything the model reads and writes is tokens.
Next-token prediction, precisely
Given the tokens so far, the model outputs a probability for every possible next token — millions of options, each with a score. It then picks one (usually a likely one), appends it, and repeats. There is no plan and no database lookup; it is one enormous, very well-trained guess at a time.
- Turn the prompt into tokens.
- For the next position, score every possible token by how likely it is.
- Pick one likely token and append it.
- Feed the longer text back in and repeat — until an “end” token or a length limit.
Training vs inference — two very different things
| Training | Inference (using it) | |
|---|---|---|
| When | Once, before release (months) | Every time you send a prompt |
| What happens | Reads trillions of tokens, adjusts billions of weights | Runs the fixed weights to predict tokens |
| Cost / speed | Enormous (huge GPU clusters) | Fast and cheap by comparison |
| Changes the model? | Yes — this is the learning | No — the weights are frozen |
This explains a common surprise: a chatbot does not learn from your conversation. Talking to it is inference — the weights are frozen. Its knowledge was baked in at training time, which is why it has a knowledge cutoff and has not heard of events after it.
The context window — its only short-term memory
The context window is the maximum amount of text (measured in tokens) the model can consider at once — its prompt plus its answer. Think of it as a whiteboard of a fixed size: everything the model “pays attention to” must fit on it. Go over the limit and the oldest text falls off the edge.
Watch out: The model has no memory beyond the context window. If a chat gets very long, earlier messages scroll out of the window and the model genuinely cannot see them any more — it is not ignoring you, it literally no longer has them.
Why hallucinations happen
A hallucination is when the model states something false but sounds confident. Now you can see why it happens: the model is trained to produce plausible-sounding next tokens, not true ones. When it does not know, it does not stop — it generates text that looks right, inventing a citation or a date that fits the pattern.
# Why a made-up answer appears
# Prompt: "The 2031 Nobel Prize in Physics was won by..."
# The model has no fact for 2031 (after its training cutoff),
# but "won by <a plausible name>" is a very likely token pattern,
# so it confidently completes with an invented name.
# It is pattern-completion, not fact-lookup.Note: Output: (No output — the point is the cause: the model predicts plausible text, so a missing fact gets filled with a convincing guess. That is a hallucination.)
Knowing the mechanism tells you how to reduce hallucinations: give the model the facts in its context window (paste the source), ask it to say when it is unsure, and verify anything important. Grounding it on real documents is exactly what RAG does — two lessons from now.
Tip: One sentence to keep: an LLM predicts plausible tokens from frozen training, limited to what fits in its context window — it does not remember your chats, learn from them, or look facts up. That single idea explains its power and every one of its quirks.
Q. Why does an LLM hallucinate (give confident but false answers)?
✍️ Practice
- Estimate the token count of a 300-word email using the “100 tokens ≈ 75 words” rule of thumb.
- Explain in one sentence why a chatbot does not remember a conversation you had with it yesterday.
🏠 Homework
- Write down two practical things you can do to reduce the chance an LLM hallucinates on an important question.