Your First Model with scikit-learn
Use your clean data to make a prediction — train a simple model and test how good it is.
What you will learn
- Split data into training and test sets
- Train a model with fit() and predict with predict()
- Read a model’s prediction and score
From describing to predicting
So far you have described data. Machine learning lets you predict from it — guess a value you have not seen. The library is scikit-learn (imported from sklearn), and the workflow is short and always the same.
- Features (X) — the inputs you know (e.g. house size).
- Label (y) — the answer you want to predict (e.g. price).
- Split — keep some rows back to test honestly.
- Train — the model learns the pattern (
fit). - Predict & score — guess new values (
predict) and grade it (score).
Predict house price from size
We will fit a straight-line model (linear regression) that learns the link between size and price you saw correlate strongly earlier.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Features (X) and label (y)
X = [[50], [60], [80], [100], [120], [140]] # house size
y = [40, 55, 70, 95, 110, 130] # price (thousands)
# Hold back 1/3 of the data to test honestly
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=0)
# Train the model on the training data
model = LinearRegression()
model.fit(X_train, y_train)
# Predict the price of a brand-new 90-size house
print('Predicted price for size 90:', round(model.predict([[90]])[0], 1))Note: Output: Predicted price for size 90: 82.5 The model learned the size-to-price pattern and predicts about 82.5k for a 90-size house — sensibly between the 80-size (70) and 100-size (95) houses it saw.
How good is it? Score on unseen data
We held back a test set the model never saw. Scoring on it tells us how well the model generalises. For regression, score returns R-squared, where 1.0 is a perfect fit.
print('R-squared on test data:', round(model.score(X_test, y_test), 3))Note: Output: R-squared on test data: 0.998 An R-squared of 0.998 (close to 1.0) means the model explains almost all the variation in price — because size and price really do move together in this data.
| Step | Code | What happens |
|---|---|---|
| Split | train_test_split(X, y) | Hold back test data |
| Create | LinearRegression() | Make an empty model |
| Train | model.fit(X_train, y_train) | Learn the pattern |
| Predict | model.predict([[90]]) | Guess a new value |
| Score | model.score(X_test, y_test) | Grade on unseen data |
Watch out: Never score a model on the same rows it trained on — it can simply memorise them and look perfect while being useless on new data. The whole point of the test set is an honest score.
Tip: This fit → predict → score pattern is identical for almost every scikit-learn model. Swap LinearRegression for a classifier and the four lines barely change — that consistency is why the library is so loved.
Q. Why do we split the data into a training set and a test set?
✍️ Practice
- Re-run the model and predict the price for a size-110 house.
- Change
test_sizeto 0.5 and see whether the R-squared score changes.
🏠 Homework
- Use the same five-step pipeline on a different idea — e.g. predict a student’s score from hours studied. Write out your X and y, then report a prediction and the score.