Reinforcement Learning: Learning by Reward
An agent tries actions, gets rewards or penalties, and slowly learns which actions pay off — like training a pet with treats.
What you will learn
- Explain the agent–environment–reward loop
- Define state, action, reward and policy
- Trace a tiny Q-learning update by hand
Learning the way you trained a pet
In supervised learning you hand the model the right answers. Reinforcement Learning (RL) is different: nobody tells the agent the correct move. Instead it tries an action, sees what happens, and gets a reward (a treat) or a penalty (a telling-off) — then it adjusts so it earns more reward next time. It is exactly how you might train a dog to sit.
This connects straight back to the agent model from Unit 2: an agent senses its environment and acts on it. RL just adds one thing — a reward signal that tells the agent how well it is doing.
The five words you need
- State — the situation the agent is in right now (e.g. which square it stands on).
- Action — a move it can make from that state (up, down, left, right).
- Reward — a number the environment gives back: positive for good, negative for bad.
- Policy — the agent’s strategy: which action to pick in each state.
- Q-value — a score for an (state, action) pair: “how much total reward do I expect if I take this action here?”
Learning means improving the policy by updating Q-values from experience, so over time the agent prefers high-reward actions.
The agent–environment loop
RL repeats one loop over and over, sometimes for millions of rounds:
- Look at the current state.
- Choose an action (early on, partly at random, to explore).
- Do the action; the environment returns a reward and the new state.
- Update the Q-value for the action you just took, nudging it toward the reward you received.
- Repeat from step 1 in the new state.
# The RL loop in pseudo-Python (the learning lives in update_Q)
state = start_state
while not done:
action = choose_action(state) # explore or exploit
reward, next_state = environment(state, action)
update_Q(state, action, reward, next_state) # learn from it
state = next_stateNote: Output: (No output — this is the shape every RL program follows. The intelligence is inside update_Q, which slowly raises the Q-values of actions that lead to reward.)
One Q-learning update by hand
The famous rule for updating a Q-value is Q-learning. In words: nudge the old estimate toward the reward you just got plus the best you expect from the next state. The formula uses two dials — the learning rate alpha (how big a step to take, here 0.5) and the discount gamma (how much future reward counts, here 0.9):
# Q-learning update for the action we just took
# new_Q = old_Q + alpha * (reward + gamma * best_next_Q - old_Q)
old_Q = 0.0 # what we thought this action was worth
reward = 10 # the treat we just received
best_next_Q = 4.0 # best Q-value available from the next state
alpha, gamma = 0.5, 0.9
new_Q = old_Q + alpha * (reward + gamma * best_next_Q - old_Q)
print('updated Q =', new_Q)Note: Output: updated Q = 6.8 Step by step: gamma * bestnextQ = 0.9 × 4 = 3.6; plus the reward 10 = 13.6; minus oldQ 0 = 13.6; times alpha 0.5 = 6.8; plus oldQ 0 = 6.8. The action that earned a treat is now worth 6.8 instead of 0, so the agent will favour it next time.
The explore-vs-exploit tension
An RL agent faces a dilemma every step: exploit (pick the action it already thinks is best) or explore (try something new that might be even better). Too much exploiting and it never discovers good moves; too much exploring and it never cashes in. Real agents usually explore a lot early on, then settle down.
Where RL is used
| Area | What the reward is | Famous example |
|---|---|---|
| Games | Winning the game | AlphaGo beating the world Go champion |
| Robotics | Staying upright / reaching a target | Robots learning to walk or grasp |
| Recommendations | Clicks / watch time | Tuning what a feed shows you |
| Control systems | Energy saved | Cooling data centres efficiently |
Watch out: RL is powerful but tricky: rewards are often delayed (you only win at the end of the game), so the agent must figure out which earlier moves deserve the credit. It also needs huge numbers of trials, which is why much RL is trained in simulators, not the real world.
Tip: Remember the one-line summary: supervised learns from answers, unsupervised finds groups, reinforcement learns from rewards by trial and error.
Q. In reinforcement learning, what does the agent actually receive after each action?
✍️ Practice
- Redo the Q-update with
reward = 0andbest_next_Q = 10. What is the new Q? - For a maze robot, write down a sensible reward for: reaching the exit, hitting a wall, and taking a normal step.
🏠 Homework
- Describe a real situation (a game or a habit) as an RL problem: name the state, the actions, and the reward.