Working with Real DataExtra· 35 min read

Handling Missing Data (Imputation)

Real datasets have holes — fill them sensibly before a model that refuses to run sees them.

What you will learn

  • Understand why missing values break models
  • Fill gaps with SimpleImputer
  • Choose mean, median or most-frequent filling

Real data is messy and full of holes

The clean toy datasets so far were a fantasy. Real data — a survey, an export from a shop, medical records — is full of missing values: a blank age, an empty income, a survey question left unanswered. In code these show up as NaN (short for Not a Number, the marker for “no value here”).

Watch out: Most scikit-learn models crash if you feed them a NaN. You cannot just ignore missing values — you must deal with them before training. This is one of the very first steps on any real dataset.

Two ways to deal with a hole

  • Drop the row (or column) with the missing value. Simple, but you throw away data — bad if many rows have a hole somewhere.
  • Impute — fill the hole with a sensible guess. Usually better, because you keep all the rows. Imputation just means “filling in missing values”.

For a numeric column, a common fill is the column’s mean (average) or median (middle value). For a category column (like “city”), you fill with the most frequent value.

A worked example with SimpleImputer

scikit-learn’s SimpleImputer fills the holes for you. Here three people have a missing age (nan). We fill each gap with the mean of the ages that are present.

Filling a missing age with the column mean
import numpy as np
from sklearn.impute import SimpleImputer

# the middle person's age is missing
X = [[25.0], [np.nan], [35.0], [40.0]]

imputer = SimpleImputer(strategy='mean')
X_filled = imputer.fit_transform(X)

print('Filled values:', [round(v[0], 2) for v in X_filled])
print('The mean used:', round(imputer.statistics_[0], 2))

Note: Output: Filled values: [25.0, 33.33, 35.0, 40.0] The mean used: 33.33 The known ages are 25, 35 and 40, whose mean is 33.33 — and that is exactly what dropped into the empty slot. The other three ages are untouched. The model can now run without crashing.

Mean, median or most-frequent?

strategyFills withBest for
'mean'The column averageNumbers with no wild outliers
'median'The middle valueNumbers with outliers (a salary of 9,000,000 will not skew it)
'most_frequent'The commonest valueCategories like city or colour
'constant'A fixed value you chooseWhen “missing” itself is meaningful

A concrete reason to prefer median sometimes: if incomes are 20k, 25k, 30k and one is 5,000k, the mean is dragged way up by the outlier and would fill blanks with an unrealistic value. The median (the middle income) ignores that extreme, so it fills more sensibly.

Watch out: Fit the imputer on the training data only, then apply it to the test data — exactly like the scaler. Computing the mean using the test set leaks information and inflates your score. (Pipelines, two lessons on, make this automatic.)

Tip: Before imputing, it is worth asking why a value is missing. Sometimes “missing” carries meaning (a blank “date returned” may mean “not returned yet”), in which case a flag column or 'constant' fill beats a blind average.

Q. Why is filling missing numbers with the median sometimes better than the mean?

Answer: A few extreme outliers drag the mean far from the typical value, so blanks get unrealistic fills. The median is the middle value and ignores those extremes, giving a more sensible fill.

✍️ Practice

  1. Change the strategy to 'median' on the age example and confirm the filled value.
  2. Make a small column with one outlier and compare the value mean-fill vs median-fill would insert.

🏠 Homework

  1. Take any small dataset idea with a few blank cells and write, for each column, whether you would drop, mean-fill, median-fill or most-frequent-fill, and why.
Want to learn this with a mentor?

CodingClave runs guided, project-based training (28-day, 45-day & 6-month batches).

Explore Training →