Stats & MLCore· 35 min read

Core Statistics: Mean, Median, Std & Correlation

A few simple statistics describe a whole column of data — its middle, its spread, and how two columns relate.

What you will learn

  • Tell mean from median (and why it matters)
  • Read the standard deviation as “spread”
  • Interpret a correlation between two columns

Describing data with a few numbers

You cannot eyeball a thousand rows. Instead, a handful of statistics summarise a column: where its middle is, how spread out it is, and how it relates to another column. Pandas computes them all for you.

Mean vs median — both are “the middle”

The mean is the average. The median is the middle value when sorted. They usually agree — but one extreme value can drag the mean far away, while the median barely moves.

Mean and median when one value is extreme
import pandas as pd

salaries = pd.Series([30, 32, 35, 38, 40, 500])   # one huge outlier!

print('mean  :', salaries.mean())
print('median:', salaries.median())

Note: Output: mean : 112.5 median: 36.5 The single 500 yanks the mean up to 112.5 — far above almost everyone. The median (36.5) ignores the outlier and describes a typical person far better. This is why news reports use “median” salary.

Standard deviation — the spread

The standard deviation (std) measures how spread out the values are. Small std = values huddle near the mean; large std = they are scattered widely.

Standard deviation compares two spreads
tight  = pd.Series([48, 50, 52, 50])     # close together
spread = pd.Series([10, 90, 30, 70])     # far apart

print('tight  std:', round(tight.std(), 1))
print('spread std:', round(spread.std(), 1))

Note: Output: tight std: 1.6 spread std: 35.9 Both series have a similar mean (50), but the tight one has a tiny std (1.6) and the scattered one a big std (35.9). The std number captures “how spread out” in a single figure.

Correlation — do two columns move together?

Correlation is a number from -1 to +1 saying how two columns move together. Near +1: they rise together. Near -1: one rises as the other falls. Near 0: no clear link.

Correlation between house size and price
df = pd.DataFrame({
    'size':  [50, 60, 80, 100, 120],
    'price': [40, 55, 70, 95, 110]
})
print(df.corr())

Note: Output: size price size 1.000000 0.998533 price 0.998533 1.000000 Size and price correlate about 0.99 — almost perfectly positive: bigger houses cost more. (Every column correlates 1.0 with itself, hence the diagonal.)

StatisticTells youPandas
MeanThe average.mean()
MedianThe middle (outlier-proof).median()
StdHow spread out.std()
CorrelationHow two columns relate.corr()

Watch out: Correlation is not causation. Ice-cream sales and sunburns correlate — but ice cream does not cause sunburn; hot, sunny weather drives both. A high correlation shows a link, not a cause.

Tip: When the mean and median differ a lot, you have outliers or a lopsided distribution. Comparing the two is a quick, free check on every numeric column.

Q. A dataset of salaries has mean 112 and median 36. What does that gap most likely mean?

Answer: The mean is sensitive to extreme values; a mean far above the median signals a few large outliers dragging the average up.

✍️ Practice

  1. Make a Series with one big outlier and show how the mean and median differ.
  2. Build a DataFrame of two columns you expect to relate (e.g. study hours and scores) and read their .corr().

🏠 Homework

  1. On a real dataset, pick a numeric column and report its mean, median and std. Then pick two columns and report their correlation, saying in words what it means.
Want to learn this with a mentor?

CodingClave runs guided, project-based training (28-day, 45-day & 6-month batches).

Explore Training →