Foundations›Core· 35 min read

Train / Test Split

Hold back some data the model never sees, so you can test it honestly.

What you will learn

Explain why we hold back test data
Split data with traintestsplit
Avoid testing on training data

Why not test on everything?

Imagine a teacher gives students the exact exam questions to practise, then sets the same questions in the real exam. Everyone scores 100% — but did they actually learn? You cannot tell.

ML has the same trap. If you test a model on the same rows it learned from, it can just memorise them and look perfect while being useless on new data. So we keep some data aside.

Split into train and test

We split the data into two parts:

Training set (usually ~70–80%) — the model learns from this.
Test set (the rest) — kept hidden, used only at the end to grade the model on data it has never seen.

Doing the split in scikit-learn

scikit-learn has a one-line helper, train_test_split, that shuffles the data and cuts it for you.

Holding back 25% of the data as a hidden test set

from sklearn.model_selection import train_test_split

X = [[1],[2],[3],[4],[5],[6],[7],[8]]   # 8 examples
y = [ 0,  0,  0,  0,  1,  1,  1,  1]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,    # keep 25% (2 of 8) for testing
    random_state=0)    # a fixed seed -> same split every run

print('Train size:', len(X_train))
print('Test size: ', len(X_test))

Note: Output: Train size: 6 Test size: 2 Six examples are used for learning; two are hidden away for the final test. random_state=0 just makes the random split repeatable, so you and a friend get the same rows.

Watch out: The number one beginner mistake is reporting accuracy on the training data. Always grade your model on the test set it has never seen.

Tip: A common split is 80/20 or 75/25. More training data usually helps the model learn; enough test data gives you a trustworthy score. random_state keeps results reproducible.

Q. Why do we keep a separate test set?

Answer: A held-back test set measures how well the model works on new data, instead of rewarding it for memorising the training rows.

✍️ Practice

Change test_size to 0.5 and print the new train and test sizes.
Explain in one sentence what would go wrong if you tested on the training data.

🏠 Homework

In your own words, write the “exam questions” analogy for why we split data — as if explaining to a classmate.