Cleaning Data: Missing Values
Real data has blanks. Before you analyse it, you find the missing values and decide whether to fill them or drop them.
What you will learn
- Find missing values with isnull().sum()
- Fill blanks with fillna()
- Decide when to drop instead of fill
Cleaning is most of the job
Data scientists spend more time cleaning data than anything else. Garbage in, garbage out: if the data is wrong, every chart and conclusion built on it is wrong too. The most common problem of all is missing values — blank cells where a number or word should be.
In Pandas a missing value shows up as NaN (short for “Not a Number” — Pandas’ way of writing a blank cell). Here is a small table with one:
Handling missing values is a small three-step routine. We will walk through each step below.
- Find the blanks — count how many each column has, with
isnull().sum(). - Decide what to do — fill the blanks with a sensible value, or drop the rows that have them.
- Fix & check — apply your choice, then run
isnull().sum()again to confirm no blanks remain.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': ['Asha', 'Ravi', 'Meera', 'Karan'],
'age': [25, np.nan, 28, 32] # Ravi's age is missing (NaN)
})
print(df)Note: Output: name age 0 Asha 25.0 1 Ravi NaN 2 Meera 28.0 3 Karan 32.0 Ravi’s age is NaN — a blank. If we tried to use this column for maths or a chart without fixing it, NaN would cause trouble.
Step 1 — find the blanks
Count missing values per column with isnull().sum(). It is the first thing to run on any new dataset.
print(df.isnull().sum()) # how many blanks in each column?Note: Output: name 0 age 1 dtype: int64 Name has no blanks; age has exactly 1. Now we know precisely what to fix before going further.
Step 2 — fill the gap
A common, gentle fix is to fill the blank with the column’s average, so we keep every row. Use fillna.
df['age'] = df['age'].fillna(df['age'].mean()) # fill blank with the average
print(df['age'])Note: Output: 0 25.000000 1 28.333333 2 28.000000 3 32.000000 Name: age, dtype: float64 Ravi’s blank became 28.33 — the average of the other three ages (25, 28, 32). No gaps remain, and we kept all four people.
Fill or drop?
Instead of filling, you can drop rows that have blanks with dropna(). Which is right depends on how much data is missing.
| Approach | Code | Use when |
|---|---|---|
| Fill with average | fillna(df['age'].mean()) | Numbers, few blanks |
| Fill with a value | fillna(0) or fillna('Unknown') | A sensible default exists |
| Drop the rows | df.dropna() | Blanks are rare and you can spare them |
Watch out: Dropping rows with dropna() throws away data — fine if blanks are rare, risky if many rows have gaps. Think before you delete; filling is often safer.
Tip: Always run isnull().sum() first. You cannot decide how to handle missing values until you know which columns have them and how many.
Q. What does df['age'].fillna(df['age'].mean()) do?
✍️ Practice
- Make a small DataFrame with one missing value and count blanks with
isnull().sum(). - Fill that blank with the column mean, then try
dropna()on a fresh copy and compare the row counts.
🏠 Homework
- Take a real CSV, run
isnull().sum(), and handle any missing values — fill the numeric columns and decide whether to drop or fill the rest. Note what you chose and why.