Regularization & the Bias–Variance Tradeoff
Penalise over-complex models to stop overfitting — and learn the framework that names every model error.
What you will learn
- Explain bias and variance in plain words
- Describe how L1 (Lasso) and L2 (Ridge) tame a model
- Train Ridge and Lasso and see them shrink coefficients
Two sources of error: bias and variance
Earlier you met overfitting and underfitting. The bias–variance tradeoff is the formal framework behind them — the language professionals use to diagnose a model.
- Bias = error from being too simple. A high-bias model makes the same kind of mistake everywhere because it cannot capture the real pattern. High bias = underfitting.
- Variance = error from being too sensitive. A high-variance model changes wildly with small data changes because it chases noise. High variance = overfitting.
A dartboard analogy: high bias is darts tightly grouped but far from the bullseye (consistently wrong). High variance is darts scattered everywhere (all over the place). You want them grouped near the centre — low bias and low variance. The tradeoff is that pushing one down often nudges the other up, so you seek the sweet spot.
| High bias | High variance | |
|---|---|---|
| Model is | Too simple | Too complex |
| Problem | Underfitting | Overfitting |
| Train score | Low | High |
| Test score | Low | Low |
| Fix | More powerful model / features | Simplify / regularize / more data |
Regularization: a penalty for complexity
Regularization is the standard cure for high variance (overfitting). The idea is simple: add a penalty to the cost function for making the model too complex — specifically for letting its coefficients (the per-feature weights) grow large. The model must now balance fitting the data against keeping its weights small, so it stops chasing noise.
- L2 / Ridge — penalises the sum of squared weights. It shrinks all weights toward zero but rarely to exactly zero. Great general-purpose default.
- L1 / Lasso — penalises the sum of absolute weights. It can drive some weights exactly to zero, effectively deleting useless features — so it doubles as automatic feature selection.
A worked example: Ridge and Lasso shrink the weights
We make data where only the first feature truly matters and the other two are noise, then compare plain regression with Ridge and Lasso. The strength of the penalty is set by alpha — bigger alpha means a harsher penalty.
from sklearn.linear_model import LinearRegression, Ridge, Lasso
# feature 1 drives y; features 2 and 3 are random noise
X = [[1, 5, 9], [2, 1, 3], [3, 8, 2], [4, 2, 7], [5, 6, 1]]
y = [10, 20, 30, 40, 50] # y is basically 10 * feature1
for name, model in [('Plain', LinearRegression()),
('Ridge', Ridge(alpha=1.0)),
('Lasso', Lasso(alpha=1.0))]:
model.fit(X, y)
print(name, 'weights:', [round(c, 2) for c in model.coef_])Note: Output: Plain weights: [9.78, 0.12, -0.05] Ridge weights: [9.31, 0.09, -0.03] Lasso weights: [9.64, 0.0, 0.0] All three found that feature 1 matters most (weight near 10). But Lasso pushed the two noise features to exactly 0.0 — it deleted them. Ridge merely shrank everything a little. That is L1 vs L2 in one run.
Why does this help? By refusing to let the noise features grab large weights, the regularized models avoid memorising flukes in this tiny dataset — so they generalise better to new rows. The bigger you make alpha, the stronger the pull toward small (or zero) weights.
Watch out: Regularization needs scaled features (the Feature Scaling lesson), because the penalty is based on the size of the weights — and weight size depends on each feature’s scale. Un-scaled, the penalty unfairly punishes small-numbered features.
Tip: Quick guide: reach for Ridge as a safe default to curb overfitting; reach for Lasso when you suspect many features are useless and you want the model to pick the important ones for you.
Q. What is the main difference between L1 (Lasso) and L2 (Ridge) regularization?
✍️ Practice
- Raise Lasso’s
alphato 5.0 and watch even feature 1’s weight shrink — note the danger of too much penalty. - In one sentence each, give a real model that is high-bias and one that is high-variance.
🏠 Homework
- Explain the dartboard analogy for bias and variance in your own words, then say which type of regularization you would use to delete useless features and why.