Evaluating Models: Splits, Metrics & Overfitting
A model that looks perfect on its training data can be useless in the wild — so you test it honestly and measure it with the right metric.
What you will learn
- Use a train/test split to get an honest score
- Pick the right metric for classification vs regression
- Recognise overfitting and underfitting
A high training score can be a lie
You trained your first model in an earlier lesson. The danger is trusting it. A model can memorise its training rows and score near-perfectly on them, then fail on new data — like a student who memorised last year’s exam answers but cannot solve a fresh question. Evaluation is how you find out whether a model actually learned the pattern or just memorised. A model you have not honestly evaluated is not trustworthy.
The train/test split — the honest exam
The core trick: before training, split your data. Train on most of it (the “study material”), then test on a held-back slice the model has never seen (the “real exam”). Only the test score is honest.
from sklearn.model_selection import train_test_split
X = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
# Keep 30% of rows aside for testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0)
print('training rows:', len(X_train))
print('test rows :', len(X_test))Note: Output:
training rows: 7
test rows : 3
test_size=0.3 held back 30% (3 of 10 rows) for testing. The model will learn from the other 7 and be graded on the 3 it never saw. random_state=0 just makes the split repeatable so your results match mine.
Metrics for classification: accuracy is not enough
When a model predicts a category (spam / not-spam, sick / healthy), the obvious metric is accuracy — the fraction it got right. But accuracy lies on imbalanced data. If 99% of emails are not-spam, a lazy model that says “not-spam” every time is 99% accurate and 100% useless. So we also use precision and recall.
- Accuracy — of all predictions, how many were correct.
- Precision — of the items it flagged as positive, how many really were (avoids false alarms).
- Recall — of all the real positives, how many it caught (avoids misses).
from sklearn.metrics import accuracy_score, precision_score, recall_score
# Truth vs what the model predicted (1 = spam)
y_true = [1, 0, 1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 0, 1, 0, 1, 1, 0]
print('accuracy :', round(accuracy_score(y_true, y_pred), 2))
print('precision:', round(precision_score(y_true, y_pred), 2))
print('recall :', round(recall_score(y_true, y_pred), 2))Note: Output: accuracy : 0.75 precision: 0.75 recall : 0.75 The model got 6 of 8 right (accuracy 0.75). Of the 4 emails it called spam, 3 truly were (precision 0.75). Of the 4 emails that truly were spam, it caught 3 (recall 0.75). For a spam filter you might favour precision (do not bin real mail); for cancer screening you favour recall (do not miss a case).
Metrics for regression: how far off, on average
When a model predicts a number (a price, a temperature), you measure the size of its errors. Two common metrics: RMSE (root mean squared error — the typical error in the same units as the target) and R² (how much of the variation the model explains, where 1.0 is perfect).
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
y_true = [100, 150, 200, 250]
y_pred = [110, 140, 210, 240]
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print('RMSE:', round(rmse, 1))
print('R^2 :', round(r2_score(y_true, y_pred), 3))Note: Output: RMSE: 10.0 R^2 : 0.984 RMSE of 10 means the model is typically off by about 10 (in the same units as the price). R² of 0.984 means it explains ~98% of the variation — a strong fit. Lower RMSE is better; higher R² (up to 1.0) is better.
Overfitting vs underfitting — the two ways to fail
The whole reason we split data is to catch two opposite problems, spotted by comparing the training score with the test score.
| Problem | What it means | The tell-tale sign |
|---|---|---|
| Overfitting | Memorised the training data, did not generalise | High train score, low test score |
| Underfitting | Too simple; missed the pattern even in training | Low train score AND low test score |
| Good fit | Learned the real pattern | Train and test scores both high and similar |
A worked read: train R² = 0.99 but test R² = 0.55 → a big gap means overfitting (it memorised). Train R² = 0.50 and test R² = 0.48 → both low means underfitting (too simple). Train 0.92 and test 0.89 → both high and close means a good fit.
Watch out: Never report the training score as your model’s quality, and never let test data leak into training (e.g. scaling using the whole dataset before splitting). Both make a weak model look great and are classic, costly mistakes.
Tip: Always compare two scores — train and test. A single number cannot tell you whether the model generalised. The gap between them is the most informative thing in machine learning.
Q. A model scores R² = 0.98 on its training data but only 0.52 on the held-back test data. What is happening?
✍️ Practice
- Split a dataset with
train_test_splitusingtest_size=0.25and print the size of each set. - For a classifier you build, print accuracy, precision and recall, and explain in one line which matters most for your problem.
🏠 Homework
- Train any model, then report both its training and test scores. State whether it is overfitting, underfitting, or a good fit, and explain how the two scores told you.