Core Statistics: Mean, Median, Std & Correlation
A few simple statistics describe a whole column of data — its middle, its spread, and how two columns relate.
What you will learn
- Tell mean from median (and why it matters)
- Read the standard deviation as “spread”
- Interpret a correlation between two columns
Describing data with a few numbers
You cannot eyeball a thousand rows. Instead, a handful of statistics summarise a column: where its middle is, how spread out it is, and how it relates to another column. Pandas computes them all for you.
Mean vs median — both are “the middle”
The mean is the average. The median is the middle value when sorted. They usually agree — but one extreme value can drag the mean far away, while the median barely moves.
import pandas as pd
salaries = pd.Series([30, 32, 35, 38, 40, 500]) # one huge outlier!
print('mean :', salaries.mean())
print('median:', salaries.median())Note: Output: mean : 112.5 median: 36.5 The single 500 yanks the mean up to 112.5 — far above almost everyone. The median (36.5) ignores the outlier and describes a typical person far better. This is why news reports use “median” salary.
Standard deviation — the spread
The standard deviation (std) measures how spread out the values are. Small std = values huddle near the mean; large std = they are scattered widely.
tight = pd.Series([48, 50, 52, 50]) # close together
spread = pd.Series([10, 90, 30, 70]) # far apart
print('tight std:', round(tight.std(), 1))
print('spread std:', round(spread.std(), 1))Note: Output: tight std: 1.6 spread std: 35.9 Both series have a similar mean (50), but the tight one has a tiny std (1.6) and the scattered one a big std (35.9). The std number captures “how spread out” in a single figure.
Correlation — do two columns move together?
Correlation is a number from -1 to +1 saying how two columns move together. Near +1: they rise together. Near -1: one rises as the other falls. Near 0: no clear link.
df = pd.DataFrame({
'size': [50, 60, 80, 100, 120],
'price': [40, 55, 70, 95, 110]
})
print(df.corr())Note: Output: size price size 1.000000 0.998533 price 0.998533 1.000000 Size and price correlate about 0.99 — almost perfectly positive: bigger houses cost more. (Every column correlates 1.0 with itself, hence the diagonal.)
| Statistic | Tells you | Pandas |
|---|---|---|
| Mean | The average | .mean() |
| Median | The middle (outlier-proof) | .median() |
| Std | How spread out | .std() |
| Correlation | How two columns relate | .corr() |
Watch out: Correlation is not causation. Ice-cream sales and sunburns correlate — but ice cream does not cause sunburn; hot, sunny weather drives both. A high correlation shows a link, not a cause.
Tip: When the mean and median differ a lot, you have outliers or a lopsided distribution. Comparing the two is a quick, free check on every numeric column.
Q. A dataset of salaries has mean 112 and median 36. What does that gap most likely mean?
✍️ Practice
- Make a Series with one big outlier and show how the mean and median differ.
- Build a DataFrame of two columns you expect to relate (e.g. study hours and scores) and read their
.corr().
🏠 Homework
- On a real dataset, pick a numeric column and report its mean, median and std. Then pick two columns and report their correlation, saying in words what it means.