Going Deeper›Pro· 45 min read

Inferential Statistics: From Sample to Population

You almost never have all the data — so you measure a sample and reason carefully about the whole, using distributions and confidence intervals.

What you will learn

Tell a sample from a population, and descriptive from inferential statistics
Explain the normal distribution and the Central Limit Theorem in plain words
Read a confidence interval as a range of plausible values

Why we infer instead of just describe

Earlier you learned descriptive statistics — mean, median, std — which describe the data in front of you. But you rarely have everyone. To learn the average height of India, you cannot measure 1.4 billion people; you measure a sample of, say, 1,000 and reason about the population (everyone). That leap — from a sample to a confident statement about the whole — is inferential statistics. It is the actual “science” in data science.

Term	Plain meaning	Example
Population	Everyone/everything you care about	All Indian adults
Sample	The few you actually measured	1,000 surveyed adults
Descriptive stats	Summarise the sample	This sample’s average
Inferential stats	Conclude about the population	India’s likely average

The normal distribution — the famous bell curve

Many real measurements (heights, exam scores, errors) cluster around an average, with fewer and fewer values as you move away — a shape called the normal distribution, or the bell curve. Its handy rule of thumb: about 68% of values fall within one standard deviation of the mean, and about 95% within two.

Generate normal data and check the 95% rule

import numpy as np

# 10,000 exam scores: average 70, standard deviation 10
np.random.seed(0)
scores = np.random.normal(loc=70, scale=10, size=10000)

within_2_std = ((scores > 50) & (scores < 90)).mean()
print('mean of sample :', round(scores.mean(), 1))
print('fraction within 50-90 (±2 std):', round(within_2_std, 3))

Note: Output: mean of sample : 70.0 fraction within 50-90 (±2 std): 0.954 We drew scores from a bell curve centred at 70 with std 10. About 95.4% landed within two standard deviations (50 to 90) — matching the “95% within 2 std” rule. That rule is what lets us turn a spread into a probability.

The Central Limit Theorem — why averages are trustworthy

Here is the idea that makes inference work, the Central Limit Theorem (CLT): even if your raw data is not bell-shaped, the average of a sample behaves like a bell curve — and bigger samples give averages that hug the true population mean more tightly. In plain words: averages are stable and predictable even when individual values are wild. That is why a poll of 1,000 people can estimate millions.

Sample means form a tight bell curve even from skewed data

import numpy as np

np.random.seed(1)
# A wildly skewed population (NOT bell-shaped)
population = np.random.exponential(scale=10, size=100000)

# Take 1000 samples of 50, record each sample's mean
sample_means = [np.random.choice(population, 50).mean() for _ in range(1000)]
print('population mean :', round(population.mean(), 2))
print('mean of sample means:', round(np.mean(sample_means), 2))
print('std of sample means :', round(np.std(sample_means), 2))

Note: Output: population mean : 10.01 mean of sample means: 10.02 std of sample means : 1.39 The raw population is lopsided, yet the 1,000 sample-means cluster tightly around the true mean (10) with a small spread (1.39). That is the CLT: sample averages are well-behaved and bell-shaped, so we can trust an estimate from a sample.

Confidence intervals — an honest range, not a single guess

A single sample mean is one guess and is almost never exactly right. A confidence interval is more honest: instead of one number, it gives a range of plausible values for the true population mean. A “95% confidence interval” means: if we repeated the sampling many times, about 95% of the intervals we built would contain the true value.

A 95% confidence interval for the true average

import numpy as np
from scipy import stats

np.random.seed(2)
sample = np.random.normal(loc=70, scale=10, size=100)   # 100 measured students

mean = sample.mean()
sem  = stats.sem(sample)                       # standard error of the mean
low, high = stats.t.interval(0.95, len(sample)-1, loc=mean, scale=sem)
print('sample mean        :', round(mean, 1))
print('95% confidence interval:', round(low, 1), 'to', round(high, 1))

Note: Output: sample mean : 69.5 95% confidence interval: 67.6 to 71.5 Our 100 students averaged 69.5, but the honest claim is: the true average for all students is plausibly between 67.6 and 71.5. We report the range, not just 69.5 — because a sample is never the last word.

Tip: A wider confidence interval means more uncertainty; a narrower one means more certainty. The fastest way to narrow it is a bigger sample — which is exactly what the Central Limit Theorem promises.

Watch out: A 95% confidence interval does not mean “95% of the data lies here” or “95% chance the true value is in this exact interval”. It means the method captures the true value 95% of the time across many repeats. Subtle, but it is the classic interview trap.

Q. What is the Central Limit Theorem’s key promise?

Answer: The CLT is about sample means, not raw data: averages of samples form a bell curve centred on the true mean, getting tighter with bigger samples — which is why a sample can estimate a whole population.

✍️ Practice

Generate 5,000 normal values with np.random.normal and check what fraction fall within one standard deviation of the mean (expect ~68%).
Draw a sample of 200 from any data and compute its mean and a 95% confidence interval; write the interval in a sentence.

🏠 Homework

On a real numeric column, take a sample of 100 rows, compute its mean and a 95% confidence interval, then compare to the full-column mean. Write 3 lines on whether the interval captured the true mean and why.