Going DeeperPro· 45 min read

Evaluating Models: Splits, Metrics & Overfitting

A model that looks perfect on its training data can be useless in the wild — so you test it honestly and measure it with the right metric.

What you will learn

  • Use a train/test split to get an honest score
  • Pick the right metric for classification vs regression
  • Recognise overfitting and underfitting

A high training score can be a lie

You trained your first model in an earlier lesson. The danger is trusting it. A model can memorise its training rows and score near-perfectly on them, then fail on new data — like a student who memorised last year’s exam answers but cannot solve a fresh question. Evaluation is how you find out whether a model actually learned the pattern or just memorised. A model you have not honestly evaluated is not trustworthy.

The train/test split — the honest exam

The core trick: before training, split your data. Train on most of it (the “study material”), then test on a held-back slice the model has never seen (the “real exam”). Only the test score is honest.

Split data into training and test sets
from sklearn.model_selection import train_test_split

X = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

# Keep 30% of rows aside for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

print('training rows:', len(X_train))
print('test rows    :', len(X_test))

Note: Output: training rows: 7 test rows : 3 test_size=0.3 held back 30% (3 of 10 rows) for testing. The model will learn from the other 7 and be graded on the 3 it never saw. random_state=0 just makes the split repeatable so your results match mine.

Metrics for classification: accuracy is not enough

When a model predicts a category (spam / not-spam, sick / healthy), the obvious metric is accuracy — the fraction it got right. But accuracy lies on imbalanced data. If 99% of emails are not-spam, a lazy model that says “not-spam” every time is 99% accurate and 100% useless. So we also use precision and recall.

  • Accuracy — of all predictions, how many were correct.
  • Precision — of the items it flagged as positive, how many really were (avoids false alarms).
  • Recall — of all the real positives, how many it caught (avoids misses).
Accuracy, precision and recall for a classifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Truth vs what the model predicted (1 = spam)
y_true = [1, 0, 1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 0, 1, 0, 1, 1, 0]

print('accuracy :', round(accuracy_score(y_true, y_pred), 2))
print('precision:', round(precision_score(y_true, y_pred), 2))
print('recall   :', round(recall_score(y_true, y_pred), 2))

Note: Output: accuracy : 0.75 precision: 0.75 recall : 0.75 The model got 6 of 8 right (accuracy 0.75). Of the 4 emails it called spam, 3 truly were (precision 0.75). Of the 4 emails that truly were spam, it caught 3 (recall 0.75). For a spam filter you might favour precision (do not bin real mail); for cancer screening you favour recall (do not miss a case).

Metrics for regression: how far off, on average

When a model predicts a number (a price, a temperature), you measure the size of its errors. Two common metrics: RMSE (root mean squared error — the typical error in the same units as the target) and (how much of the variation the model explains, where 1.0 is perfect).

RMSE and R² for a regression model
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

y_true = [100, 150, 200, 250]
y_pred = [110, 140, 210, 240]

rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print('RMSE:', round(rmse, 1))
print('R^2 :', round(r2_score(y_true, y_pred), 3))

Note: Output: RMSE: 10.0 R^2 : 0.984 RMSE of 10 means the model is typically off by about 10 (in the same units as the price). R² of 0.984 means it explains ~98% of the variation — a strong fit. Lower RMSE is better; higher R² (up to 1.0) is better.

Overfitting vs underfitting — the two ways to fail

The whole reason we split data is to catch two opposite problems, spotted by comparing the training score with the test score.

ProblemWhat it meansThe tell-tale sign
OverfittingMemorised the training data, did not generaliseHigh train score, low test score
UnderfittingToo simple; missed the pattern even in trainingLow train score AND low test score
Good fitLearned the real patternTrain and test scores both high and similar

A worked read: train R² = 0.99 but test R² = 0.55 → a big gap means overfitting (it memorised). Train R² = 0.50 and test R² = 0.48 → both low means underfitting (too simple). Train 0.92 and test 0.89 → both high and close means a good fit.

Watch out: Never report the training score as your model’s quality, and never let test data leak into training (e.g. scaling using the whole dataset before splitting). Both make a weak model look great and are classic, costly mistakes.

Tip: Always compare two scores — train and test. A single number cannot tell you whether the model generalised. The gap between them is the most informative thing in machine learning.

Q. A model scores R² = 0.98 on its training data but only 0.52 on the held-back test data. What is happening?

Answer: A high training score with a much lower test score is the signature of overfitting: the model learned the training rows specifically instead of the general pattern, so it fails on new data.

✍️ Practice

  1. Split a dataset with train_test_split using test_size=0.25 and print the size of each set.
  2. For a classifier you build, print accuracy, precision and recall, and explain in one line which matters most for your problem.

🏠 Homework

  1. Train any model, then report both its training and test scores. State whether it is overfitting, underfitting, or a good fit, and explain how the two scores told you.
Want to learn this with a mentor?

CodingClave runs guided, project-based training (28-day, 45-day & 6-month batches).

Explore Training →