Evaluate & Improve›Extra· 40 min read

How Models Learn: Cost Functions & Gradient Descent

Under the hood, training is just walking downhill on an error landscape to find the best settings.

What you will learn

Explain what a cost (loss) function measures
Describe gradient descent in plain words
See the learning rate change how fast it learns

What does “fit” actually do?

Every time you call .fit(), the model is searching for the best settings — for a straight line, the best slope and intercept. But what does “best” mean, and how does it find them? Two ideas answer that: a cost function and gradient descent.

The cost function: a score for how wrong you are

A cost function (also called a loss function) is a single number that measures how wrong the model’s current predictions are. Big number = very wrong; small number = close. Training means changing the settings to make this number as small as possible.

A common cost for regression is the mean squared error: for each row, take the gap between the prediction and the truth, square it, then average. Let us compute it for a line that is a bit off.

A cost function turns “how wrong” into one number

# True values vs a model's predictions
y_true = [10, 20, 30]
y_pred = [12, 18, 33]      # each guess is a little off

errors  = [(t - p) for t, p in zip(y_true, y_pred)]
squared = [e**2 for e in errors]
cost = sum(squared) / len(squared)

print('Errors: ', errors)
print('Squared:', squared)
print('Cost (mean squared error):', round(cost, 2))

Note: Output: Errors: [-2, 2, -3] Squared: [4, 4, 9] Cost (mean squared error): 5.67 The cost is 5.67. If we nudge the line so its predictions move closer to the truth, every squared error shrinks and the cost drops. Training’s whole job is to push this number down.

Gradient descent: walking downhill

Imagine the cost as a valley. Every possible setting of the model is a spot on the hillside, and its height is the cost. The lowest point in the valley is the best model. Gradient descent is how the model walks down to it.

Start somewhere on the hill (a random guess for the settings).
Look at the slope under your feet — which way is downhill (the gradient).
Take a small step in that downhill direction.
Repeat: each step lowers the cost a little.
Stop when the ground is flat — you have reached the bottom (the best settings).

You are literally rolling a ball into the lowest point of the error valley. Each step makes the predictions a bit better.

The learning rate: how big each step is

The learning rate is the size of each downhill step. It matters a lot:

Learning rate	What happens	Result
Too small	Tiny steps	Learns correctly but very slowly
Just right	Sensible steps	Reaches the bottom efficiently
Too big	Giant leaps that overshoot	Bounces around, may never settle

Here is a stripped-down gradient descent guessing a single number (the true value is 10). Watch the guess march toward 10, step by step.

A tiny gradient-descent loop creeping toward the target

target = 10.0
guess = 0.0
learning_rate = 0.3

for step in range(5):
    error = guess - target          # how far off we are
    guess = guess - learning_rate * error   # step downhill
    print('Step', step + 1, '-> guess', round(guess, 2))

Note: Output: Step 1 -> guess 3.0 Step 2 -> guess 5.1 Step 3 -> guess 6.57 Step 4 -> guess 7.6 Step 5 -> guess 8.32 Each step moves the guess closer to 10 — fast at first (big error, big step) and slowing as it nears the target. That slowing-down near the bottom is gradient descent in action.

Watch out: A learning rate that is too high makes the guess overshoot and bounce further away each step instead of settling. If training “explodes” into huge numbers, lowering the learning rate is the first fix to try.

Tip: You will not usually code gradient descent yourself — scikit-learn does it inside .fit(). But knowing it is just “downhill steps on a cost function” demystifies training, neural networks, and the learning_rate you met in boosting.

Q. What is gradient descent doing during training?

Answer: Gradient descent repeatedly steps in the downhill direction of the cost function, gradually reducing the error until it reaches the lowest point — the best settings.

✍️ Practice

Change learning_rate to 1.5 in the loop and watch the guess overshoot and bounce.
Change it to 0.05 and note how many more steps it would need to reach the target.

🏠 Homework

In your own words, explain the “ball rolling into a valley” picture of gradient descent and what the cost function and learning rate each represent.