Working with Real DataPro· 35 min read

Pipelines: Chaining Steps Safely

Bundle preprocessing and the model into one object so the same steps always run together — no leakage.

What you will learn

  • Explain why pipelines prevent data leakage
  • Build a Pipeline of scaler + model
  • See preprocessing applied automatically at predict time

The problem pipelines solve

Real projects chain several steps: impute missing values, encode categories, scale features, then train the model. If you run these by hand it is easy to make a costly mistake — most often data leakage, where information from the test set sneaks into training and makes your score a lie.

A classic leak: scaling the whole dataset before splitting, so the scaler “saw” the test rows. A Pipeline removes that risk by binding all the steps into one object that knows to fit on training data only.

What a pipeline is

A Pipeline is a single object that holds an ordered list of steps. When you call .fit() on the pipeline, it runs every preprocessing step and trains the model — in order. When you call .predict(), it runs the same preprocessing on the new data automatically before predicting. You can never forget a step or apply it inconsistently.

A worked example: scaler + KNN

KNN needs scaled features. Instead of scaling by hand and risking a leak, we put the scaler and the model in one pipeline. make_pipeline chains them in order.

A pipeline that scales then classifies, leak-free
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.3, random_state=0)

# scale THEN classify, as one object
pipe = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=3))
pipe.fit(Xtr, ytr)              # scaler fits on train only, then KNN trains

print('Test accuracy:', round(pipe.score(Xte, yte), 3))

Note: Output: Test accuracy: 0.978 One pipe.fit() scaled the training data and trained KNN; pipe.score() then scaled the test data with the same scaler and predicted. The scaler never saw the test rows during fitting — no leakage, and far less code to get wrong.

Why this is the professional default

  • No leakage — every preprocessing step fits on training data only, automatically.
  • Consistency — the exact same steps run at predict time, so new data is treated identically.
  • Works with tuning & cross-validation — a pipeline behaves like a single model, so GridSearchCV and cross_val_score apply the preprocessing correctly inside every fold.

When different columns need different treatment — say, impute + scale numbers but one-hot encode categories — you wrap those branches in a ColumnTransformer and drop that into the pipeline. The principle is identical: one object, steps applied in order, fit on training data only.

Tip: Once you are past tiny demos, always wrap preprocessing and the model in a pipeline. It is the single best habit for clean, leak-free scikit-learn code — and reviewers expect it.

Q. What is the main reason to use a scikit-learn Pipeline?

Answer: A pipeline bundles preprocessing and the model so the same steps run in order and fit only on training data, which prevents leakage and keeps predict-time behaviour consistent.

✍️ Practice

  1. Swap KNeighborsClassifier for LogisticRegression(max_iter=200) in the pipeline and compare the test accuracy.
  2. Explain in one sentence how scaling the whole dataset before splitting leaks information.

🏠 Homework

  1. Describe, in your own words, a data-leakage mistake you could make by preprocessing by hand, and how a pipeline prevents it.
Want to learn this with a mentor?

CodingClave runs guided, project-based training (28-day, 45-day & 6-month batches).

Explore Training →