Pipelines: Chaining Steps Safely
Bundle preprocessing and the model into one object so the same steps always run together — no leakage.
What you will learn
- Explain why pipelines prevent data leakage
- Build a Pipeline of scaler + model
- See preprocessing applied automatically at predict time
The problem pipelines solve
Real projects chain several steps: impute missing values, encode categories, scale features, then train the model. If you run these by hand it is easy to make a costly mistake — most often data leakage, where information from the test set sneaks into training and makes your score a lie.
A classic leak: scaling the whole dataset before splitting, so the scaler “saw” the test rows. A Pipeline removes that risk by binding all the steps into one object that knows to fit on training data only.
What a pipeline is
A Pipeline is a single object that holds an ordered list of steps. When you call .fit() on the pipeline, it runs every preprocessing step and trains the model — in order. When you call .predict(), it runs the same preprocessing on the new data automatically before predicting. You can never forget a step or apply it inconsistently.
A worked example: scaler + KNN
KNN needs scaled features. Instead of scaling by hand and risking a leak, we put the scaler and the model in one pipeline. make_pipeline chains them in order.
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.3, random_state=0)
# scale THEN classify, as one object
pipe = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=3))
pipe.fit(Xtr, ytr) # scaler fits on train only, then KNN trains
print('Test accuracy:', round(pipe.score(Xte, yte), 3))Note: Output:
Test accuracy: 0.978
One pipe.fit() scaled the training data and trained KNN; pipe.score() then scaled the test data with the same scaler and predicted. The scaler never saw the test rows during fitting — no leakage, and far less code to get wrong.
Why this is the professional default
- No leakage — every preprocessing step fits on training data only, automatically.
- Consistency — the exact same steps run at predict time, so new data is treated identically.
- Works with tuning & cross-validation — a pipeline behaves like a single model, so
GridSearchCVandcross_val_scoreapply the preprocessing correctly inside every fold.
When different columns need different treatment — say, impute + scale numbers but one-hot encode categories — you wrap those branches in a ColumnTransformer and drop that into the pipeline. The principle is identical: one object, steps applied in order, fit on training data only.
Tip: Once you are past tiny demos, always wrap preprocessing and the model in a pipeline. It is the single best habit for clean, leak-free scikit-learn code — and reviewers expect it.
Q. What is the main reason to use a scikit-learn Pipeline?
✍️ Practice
- Swap
KNeighborsClassifierforLogisticRegression(max_iter=200)in the pipeline and compare the test accuracy. - Explain in one sentence how scaling the whole dataset before splitting leaks information.
🏠 Homework
- Describe, in your own words, a data-leakage mistake you could make by preprocessing by hand, and how a pipeline prevents it.