Supervised Learning›Pro· 40 min read

Boosting & Gradient Boosting

Build trees one after another, each fixing the mistakes of the last — the technique that wins most tabular ML.

What you will learn

Tell bagging apart from boosting
Explain how boosting learns from its own errors
Train a gradient boosting model and compare it to a forest

A different way to combine trees

A random forest grows many trees in parallel, each on a random slice, then lets them vote. That is called bagging. Boosting takes the opposite approach: it grows trees one at a time, in a chain, and each new tree focuses on the rows the previous trees got wrong.

Picture a team marking exam papers. The first marker does a rough job. The second marker does not re-mark everything — they concentrate on the papers the first one messed up. The third fixes what is still wrong, and so on. Each step boosts the result by patching the remaining mistakes. Add up everyone’s corrections and the final marking is excellent.

	Bagging (Random Forest)	Boosting (Gradient Boosting)
Trees built	In parallel, independently	One after another, in a chain
Each tree focuses on	A random slice of data	The previous trees’ mistakes
Main strength	Stable, hard to break	Often the highest accuracy
Main risk	Slightly less accurate	Can overfit if pushed too hard

How gradient boosting learns from errors

The most popular form is gradient boosting. “Gradient” just means it measures the leftover error after each tree and points the next tree at it. Step by step:

Make a first simple prediction (often just the average).
Measure the error — how far off each row is (the leftover, called the residual).
Train a small tree to predict that error, and add its correction to the running prediction.
Measure the new, smaller error and repeat — each tree shrinks what is left.
Stop after a set number of trees; the sum of all the corrections is the final model.

A learning rate controls how big each correction is. Small, careful steps (a low learning rate) with more trees usually generalise better than a few big leaps.

A worked example

We classify our fruit again, comparing a random forest with GradientBoostingClassifier on the same data.

Comparing bagging (forest) with boosting on the fruit data

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

X = [[7, 150], [7, 170], [6, 140],     # apples
     [8, 110], [9, 120], [8, 100]]     # oranges
y = ['apple','apple','apple',
     'orange','orange','orange']

forest = RandomForestClassifier(n_estimators=100, random_state=0).fit(X, y)
boost  = GradientBoostingClassifier(n_estimators=100,
             learning_rate=0.1, random_state=0).fit(X, y)

mystery = [[7, 160]]
print('Forest ->', forest.predict(mystery)[0])
print('Boost  ->', boost.predict(mystery)[0])

Note: Output: Forest -> apple Boost -> apple Both call the mystery fruit an apple. On this tiny dataset they agree, but on large, messy real-world tables boosting usually edges ahead — it keeps drilling into the hard, easily-confused rows the forest treats the same as any other.

XGBoost, LightGBM and friends

You will hear names like XGBoost, LightGBM and CatBoost. These are faster, more polished libraries that all do gradient boosting under the hood. They dominate Kaggle competitions and most real-world work on tabular data (rows and columns, like a spreadsheet). scikit-learn’s GradientBoostingClassifier teaches the exact same idea; the others just run quicker on big data.

Watch out: Boosting can overfit if you use too many trees or too high a learning rate — it will eventually start memorising the noise. Use a modest learning_rate (around 0.05–0.1), watch your test score, and tune the number of trees (the Hyperparameter Tuning lesson shows how).

Tip: Rule of thumb on tabular data: a random forest is the safe, no-fuss default; gradient boosting is what you reach for when you want to squeeze out the last bit of accuracy and are willing to tune it.

Q. How does boosting differ from bagging (random forest)?

Answer: Bagging builds independent trees in parallel and votes. Boosting builds trees sequentially, each one focusing on the errors left by the trees before it.

✍️ Practice

Change learning_rate to 1.0 and to 0.01 on the fruit data and note that the prediction can wobble at extremes.
Write one sentence each explaining when you would pick a random forest and when you would pick gradient boosting.

🏠 Homework

Explain the “team of exam markers” analogy for boosting in your own words, and say why each new tree focuses on the previous mistakes.