Evaluate & Improve›Pro· 40 min read

Regularization & the Bias–Variance Tradeoff

Penalise over-complex models to stop overfitting — and learn the framework that names every model error.

What you will learn

Explain bias and variance in plain words
Describe how L1 (Lasso) and L2 (Ridge) tame a model
Train Ridge and Lasso and see them shrink coefficients

Two sources of error: bias and variance

Earlier you met overfitting and underfitting. The bias–variance tradeoff is the formal framework behind them — the language professionals use to diagnose a model.

Bias = error from being too simple. A high-bias model makes the same kind of mistake everywhere because it cannot capture the real pattern. High bias = underfitting.
Variance = error from being too sensitive. A high-variance model changes wildly with small data changes because it chases noise. High variance = overfitting.

A dartboard analogy: high bias is darts tightly grouped but far from the bullseye (consistently wrong). High variance is darts scattered everywhere (all over the place). You want them grouped near the centre — low bias and low variance. The tradeoff is that pushing one down often nudges the other up, so you seek the sweet spot.

	High bias	High variance
Model is	Too simple	Too complex
Problem	Underfitting	Overfitting
Train score	Low	High
Test score	Low	Low
Fix	More powerful model / features	Simplify / regularize / more data

Regularization: a penalty for complexity

Regularization is the standard cure for high variance (overfitting). The idea is simple: add a penalty to the cost function for making the model too complex — specifically for letting its coefficients (the per-feature weights) grow large. The model must now balance fitting the data against keeping its weights small, so it stops chasing noise.

L2 / Ridge — penalises the sum of squared weights. It shrinks all weights toward zero but rarely to exactly zero. Great general-purpose default.
L1 / Lasso — penalises the sum of absolute weights. It can drive some weights exactly to zero, effectively deleting useless features — so it doubles as automatic feature selection.

A worked example: Ridge and Lasso shrink the weights

We make data where only the first feature truly matters and the other two are noise, then compare plain regression with Ridge and Lasso. The strength of the penalty is set by alpha — bigger alpha means a harsher penalty.

Ridge shrinks weights; Lasso zeroes out the noise features

from sklearn.linear_model import LinearRegression, Ridge, Lasso

# feature 1 drives y; features 2 and 3 are random noise
X = [[1, 5, 9], [2, 1, 3], [3, 8, 2], [4, 2, 7], [5, 6, 1]]
y = [10, 20, 30, 40, 50]      # y is basically 10 * feature1

for name, model in [('Plain', LinearRegression()),
                    ('Ridge', Ridge(alpha=1.0)),
                    ('Lasso', Lasso(alpha=1.0))]:
    model.fit(X, y)
    print(name, 'weights:', [round(c, 2) for c in model.coef_])

Note: Output: Plain weights: [9.78, 0.12, -0.05] Ridge weights: [9.31, 0.09, -0.03] Lasso weights: [9.64, 0.0, 0.0] All three found that feature 1 matters most (weight near 10). But Lasso pushed the two noise features to exactly 0.0 — it deleted them. Ridge merely shrank everything a little. That is L1 vs L2 in one run.

Why does this help? By refusing to let the noise features grab large weights, the regularized models avoid memorising flukes in this tiny dataset — so they generalise better to new rows. The bigger you make alpha, the stronger the pull toward small (or zero) weights.

Watch out: Regularization needs scaled features (the Feature Scaling lesson), because the penalty is based on the size of the weights — and weight size depends on each feature’s scale. Un-scaled, the penalty unfairly punishes small-numbered features.

Tip: Quick guide: reach for Ridge as a safe default to curb overfitting; reach for Lasso when you suspect many features are useless and you want the model to pick the important ones for you.

Q. What is the main difference between L1 (Lasso) and L2 (Ridge) regularization?

Answer: Both penalise large weights to fight overfitting. L1/Lasso can set some weights to exactly zero, effectively selecting features; L2/Ridge shrinks all weights smoothly but seldom to zero.

✍️ Practice

Raise Lasso’s alpha to 5.0 and watch even feature 1’s weight shrink — note the danger of too much penalty.
In one sentence each, give a real model that is high-bias and one that is high-variance.

🏠 Homework

Explain the dartboard analogy for bias and variance in your own words, then say which type of regularization you would use to delete useless features and why.