How Models Learn: Cost Functions & Gradient Descent
Under the hood, training is just walking downhill on an error landscape to find the best settings.
What you will learn
- Explain what a cost (loss) function measures
- Describe gradient descent in plain words
- See the learning rate change how fast it learns
What does “fit” actually do?
Every time you call .fit(), the model is searching for the best settings — for a straight line, the best slope and intercept. But what does “best” mean, and how does it find them? Two ideas answer that: a cost function and gradient descent.
The cost function: a score for how wrong you are
A cost function (also called a loss function) is a single number that measures how wrong the model’s current predictions are. Big number = very wrong; small number = close. Training means changing the settings to make this number as small as possible.
A common cost for regression is the mean squared error: for each row, take the gap between the prediction and the truth, square it, then average. Let us compute it for a line that is a bit off.
# True values vs a model's predictions
y_true = [10, 20, 30]
y_pred = [12, 18, 33] # each guess is a little off
errors = [(t - p) for t, p in zip(y_true, y_pred)]
squared = [e**2 for e in errors]
cost = sum(squared) / len(squared)
print('Errors: ', errors)
print('Squared:', squared)
print('Cost (mean squared error):', round(cost, 2))Note: Output: Errors: [-2, 2, -3] Squared: [4, 4, 9] Cost (mean squared error): 5.67 The cost is 5.67. If we nudge the line so its predictions move closer to the truth, every squared error shrinks and the cost drops. Training’s whole job is to push this number down.
Gradient descent: walking downhill
Imagine the cost as a valley. Every possible setting of the model is a spot on the hillside, and its height is the cost. The lowest point in the valley is the best model. Gradient descent is how the model walks down to it.
- Start somewhere on the hill (a random guess for the settings).
- Look at the slope under your feet — which way is downhill (the gradient).
- Take a small step in that downhill direction.
- Repeat: each step lowers the cost a little.
- Stop when the ground is flat — you have reached the bottom (the best settings).
You are literally rolling a ball into the lowest point of the error valley. Each step makes the predictions a bit better.
The learning rate: how big each step is
The learning rate is the size of each downhill step. It matters a lot:
| Learning rate | What happens | Result |
|---|---|---|
| Too small | Tiny steps | Learns correctly but very slowly |
| Just right | Sensible steps | Reaches the bottom efficiently |
| Too big | Giant leaps that overshoot | Bounces around, may never settle |
Here is a stripped-down gradient descent guessing a single number (the true value is 10). Watch the guess march toward 10, step by step.
target = 10.0
guess = 0.0
learning_rate = 0.3
for step in range(5):
error = guess - target # how far off we are
guess = guess - learning_rate * error # step downhill
print('Step', step + 1, '-> guess', round(guess, 2))Note: Output: Step 1 -> guess 3.0 Step 2 -> guess 5.1 Step 3 -> guess 6.57 Step 4 -> guess 7.6 Step 5 -> guess 8.32 Each step moves the guess closer to 10 — fast at first (big error, big step) and slowing as it nears the target. That slowing-down near the bottom is gradient descent in action.
Watch out: A learning rate that is too high makes the guess overshoot and bounce further away each step instead of settling. If training “explodes” into huge numbers, lowering the learning rate is the first fix to try.
Tip: You will not usually code gradient descent yourself — scikit-learn does it inside .fit(). But knowing it is just “downhill steps on a cost function” demystifies training, neural networks, and the learning_rate you met in boosting.
Q. What is gradient descent doing during training?
✍️ Practice
- Change
learning_rateto 1.5 in the loop and watch the guess overshoot and bounce. - Change it to 0.05 and note how many more steps it would need to reach the target.
🏠 Homework
- In your own words, explain the “ball rolling into a valley” picture of gradient descent and what the cost function and learning rate each represent.