Machine Learning for AI›Extra· 35 min read

Reinforcement Learning: Learning by Reward

An agent tries actions, gets rewards or penalties, and slowly learns which actions pay off — like training a pet with treats.

What you will learn

Explain the agent–environment–reward loop
Define state, action, reward and policy
Trace a tiny Q-learning update by hand

Learning the way you trained a pet

In supervised learning you hand the model the right answers. Reinforcement Learning (RL) is different: nobody tells the agent the correct move. Instead it tries an action, sees what happens, and gets a reward (a treat) or a penalty (a telling-off) — then it adjusts so it earns more reward next time. It is exactly how you might train a dog to sit.

This connects straight back to the agent model from Unit 2: an agent senses its environment and acts on it. RL just adds one thing — a reward signal that tells the agent how well it is doing.

The five words you need

State — the situation the agent is in right now (e.g. which square it stands on).
Action — a move it can make from that state (up, down, left, right).
Reward — a number the environment gives back: positive for good, negative for bad.
Policy — the agent’s strategy: which action to pick in each state.
Q-value — a score for an (state, action) pair: “how much total reward do I expect if I take this action here?”

Learning means improving the policy by updating Q-values from experience, so over time the agent prefers high-reward actions.

The agent–environment loop

RL repeats one loop over and over, sometimes for millions of rounds:

Look at the current state.
Choose an action (early on, partly at random, to explore).
Do the action; the environment returns a reward and the new state.
Update the Q-value for the action you just took, nudging it toward the reward you received.
Repeat from step 1 in the new state.

Reinforcement learning: act, get a reward, update, repeat

# The RL loop in pseudo-Python (the learning lives in update_Q)
state = start_state
while not done:
    action = choose_action(state)          # explore or exploit
    reward, next_state = environment(state, action)
    update_Q(state, action, reward, next_state)   # learn from it
    state = next_state

Note: Output: (No output — this is the shape every RL program follows. The intelligence is inside update_Q, which slowly raises the Q-values of actions that lead to reward.)

One Q-learning update by hand

The famous rule for updating a Q-value is Q-learning. In words: nudge the old estimate toward the reward you just got plus the best you expect from the next state. The formula uses two dials — the learning rate alpha (how big a step to take, here 0.5) and the discount gamma (how much future reward counts, here 0.9):

A single Q-learning update with real numbers

# Q-learning update for the action we just took
# new_Q = old_Q + alpha * (reward + gamma * best_next_Q - old_Q)

old_Q       = 0.0    # what we thought this action was worth
reward      = 10     # the treat we just received
best_next_Q = 4.0    # best Q-value available from the next state
alpha, gamma = 0.5, 0.9

new_Q = old_Q + alpha * (reward + gamma * best_next_Q - old_Q)
print('updated Q =', new_Q)

Note: Output: updated Q = 6.8 Step by step: gamma * bestnextQ = 0.9 × 4 = 3.6; plus the reward 10 = 13.6; minus oldQ 0 = 13.6; times alpha 0.5 = 6.8; plus oldQ 0 = 6.8. The action that earned a treat is now worth 6.8 instead of 0, so the agent will favour it next time.

The explore-vs-exploit tension

An RL agent faces a dilemma every step: exploit (pick the action it already thinks is best) or explore (try something new that might be even better). Too much exploiting and it never discovers good moves; too much exploring and it never cashes in. Real agents usually explore a lot early on, then settle down.

Where RL is used

Area	What the reward is	Famous example
Games	Winning the game	AlphaGo beating the world Go champion
Robotics	Staying upright / reaching a target	Robots learning to walk or grasp
Recommendations	Clicks / watch time	Tuning what a feed shows you
Control systems	Energy saved	Cooling data centres efficiently

Watch out: RL is powerful but tricky: rewards are often delayed (you only win at the end of the game), so the agent must figure out which earlier moves deserve the credit. It also needs huge numbers of trials, which is why much RL is trained in simulators, not the real world.

Tip: Remember the one-line summary: supervised learns from answers, unsupervised finds groups, reinforcement learns from rewards by trial and error.

Q. In reinforcement learning, what does the agent actually receive after each action?

Answer: RL gives no labelled answers — only a reward signal and the resulting state. The agent must learn good behaviour from those rewards over many trials.

✍️ Practice

Redo the Q-update with reward = 0 and best_next_Q = 10. What is the new Q?
For a maze robot, write down a sensible reward for: reaching the exit, hitting a wall, and taking a normal step.

🏠 Homework

Describe a real situation (a game or a habit) as an RL problem: name the state, the actions, and the reward.