Stats & ML›Extra· 45 min read

Your First Model with scikit-learn

Use your clean data to make a prediction — train a simple model and test how good it is.

What you will learn

Split data into training and test sets
Train a model with fit() and predict with predict()
Read a model’s prediction and score

From describing to predicting

So far you have described data. Machine learning lets you predict from it — guess a value you have not seen. The library is scikit-learn (imported from sklearn), and the workflow is short and always the same.

Features (X) — the inputs you know (e.g. house size).
Label (y) — the answer you want to predict (e.g. price).
Split — keep some rows back to test honestly.
Train — the model learns the pattern (fit).
Predict & score — guess new values (predict) and grade it (score).

Predict house price from size

We will fit a straight-line model (linear regression) that learns the link between size and price you saw correlate strongly earlier.

Train a linear-regression model and predict a new value

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Features (X) and label (y)
X = [[50], [60], [80], [100], [120], [140]]   # house size
y = [40, 55, 70, 95, 110, 130]                # price (thousands)

# Hold back 1/3 of the data to test honestly
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=0)

# Train the model on the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Predict the price of a brand-new 90-size house
print('Predicted price for size 90:', round(model.predict([[90]])[0], 1))

Note: Output: Predicted price for size 90: 82.5 The model learned the size-to-price pattern and predicts about 82.5k for a 90-size house — sensibly between the 80-size (70) and 100-size (95) houses it saw.

How good is it? Score on unseen data

We held back a test set the model never saw. Scoring on it tells us how well the model generalises. For regression, score returns R-squared, where 1.0 is a perfect fit.

Grade the model on data it never trained on

print('R-squared on test data:', round(model.score(X_test, y_test), 3))

Note: Output: R-squared on test data: 0.998 An R-squared of 0.998 (close to 1.0) means the model explains almost all the variation in price — because size and price really do move together in this data.

Step	Code	What happens
Split	`train_test_split(X, y)`	Hold back test data
Create	`LinearRegression()`	Make an empty model
Train	`model.fit(X_train, y_train)`	Learn the pattern
Predict	`model.predict([[90]])`	Guess a new value
Score	`model.score(X_test, y_test)`	Grade on unseen data

Watch out: Never score a model on the same rows it trained on — it can simply memorise them and look perfect while being useless on new data. The whole point of the test set is an honest score.

Tip: This fit → predict → score pattern is identical for almost every scikit-learn model. Swap LinearRegression for a classifier and the four lines barely change — that consistency is why the library is so loved.

Q. Why do we split the data into a training set and a test set?

Answer: A held-back test set measures how well the model predicts new, unseen data — preventing a misleadingly high score from memorised training rows.

✍️ Practice

Re-run the model and predict the price for a size-110 house.
Change test_size to 0.5 and see whether the R-squared score changes.

🏠 Homework

Use the same five-step pipeline on a different idea — e.g. predict a student’s score from hours studied. Write out your X and y, then report a prediction and the score.