Train / Test Split
Hold back some data the model never sees, so you can test it honestly.
What you will learn
- Explain why we hold back test data
- Split data with traintestsplit
- Avoid testing on training data
Why not test on everything?
Imagine a teacher gives students the exact exam questions to practise, then sets the same questions in the real exam. Everyone scores 100% — but did they actually learn? You cannot tell.
ML has the same trap. If you test a model on the same rows it learned from, it can just memorise them and look perfect while being useless on new data. So we keep some data aside.
Split into train and test
We split the data into two parts:
- Training set (usually ~70–80%) — the model learns from this.
- Test set (the rest) — kept hidden, used only at the end to grade the model on data it has never seen.
Doing the split in scikit-learn
scikit-learn has a one-line helper, train_test_split, that shuffles the data and cuts it for you.
from sklearn.model_selection import train_test_split
X = [[1],[2],[3],[4],[5],[6],[7],[8]] # 8 examples
y = [ 0, 0, 0, 0, 1, 1, 1, 1]
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.25, # keep 25% (2 of 8) for testing
random_state=0) # a fixed seed -> same split every run
print('Train size:', len(X_train))
print('Test size: ', len(X_test))Note: Output:
Train size: 6
Test size: 2
Six examples are used for learning; two are hidden away for the final test. random_state=0 just makes the random split repeatable, so you and a friend get the same rows.
Watch out: The number one beginner mistake is reporting accuracy on the training data. Always grade your model on the test set it has never seen.
Tip: A common split is 80/20 or 75/25. More training data usually helps the model learn; enough test data gives you a trustworthy score. random_state keeps results reproducible.
Q. Why do we keep a separate test set?
✍️ Practice
- Change
test_sizeto 0.5 and print the new train and test sizes. - Explain in one sentence what would go wrong if you tested on the training data.
🏠 Homework
- In your own words, write the “exam questions” analogy for why we split data — as if explaining to a classmate.