FoundationsCore· 35 min read

Train / Test Split

Hold back some data the model never sees, so you can test it honestly.

What you will learn

  • Explain why we hold back test data
  • Split data with traintestsplit
  • Avoid testing on training data

Why not test on everything?

Imagine a teacher gives students the exact exam questions to practise, then sets the same questions in the real exam. Everyone scores 100% — but did they actually learn? You cannot tell.

ML has the same trap. If you test a model on the same rows it learned from, it can just memorise them and look perfect while being useless on new data. So we keep some data aside.

Split into train and test

We split the data into two parts:

  • Training set (usually ~70–80%) — the model learns from this.
  • Test set (the rest) — kept hidden, used only at the end to grade the model on data it has never seen.

Doing the split in scikit-learn

scikit-learn has a one-line helper, train_test_split, that shuffles the data and cuts it for you.

Holding back 25% of the data as a hidden test set
from sklearn.model_selection import train_test_split

X = [[1],[2],[3],[4],[5],[6],[7],[8]]   # 8 examples
y = [ 0,  0,  0,  0,  1,  1,  1,  1]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,    # keep 25% (2 of 8) for testing
    random_state=0)    # a fixed seed -> same split every run

print('Train size:', len(X_train))
print('Test size: ', len(X_test))

Note: Output: Train size: 6 Test size: 2 Six examples are used for learning; two are hidden away for the final test. random_state=0 just makes the random split repeatable, so you and a friend get the same rows.

Watch out: The number one beginner mistake is reporting accuracy on the training data. Always grade your model on the test set it has never seen.

Tip: A common split is 80/20 or 75/25. More training data usually helps the model learn; enough test data gives you a trustworthy score. random_state keeps results reproducible.

Q. Why do we keep a separate test set?

Answer: A held-back test set measures how well the model works on new data, instead of rewarding it for memorising the training rows.

✍️ Practice

  1. Change test_size to 0.5 and print the new train and test sizes.
  2. Explain in one sentence what would go wrong if you tested on the training data.

🏠 Homework

  1. In your own words, write the “exam questions” analogy for why we split data — as if explaining to a classmate.
Want to learn this with a mentor?

CodingClave runs guided, project-based training (28-day, 45-day & 6-month batches).

Explore Training →