Inferential Statistics: From Sample to Population
You almost never have all the data — so you measure a sample and reason carefully about the whole, using distributions and confidence intervals.
What you will learn
- Tell a sample from a population, and descriptive from inferential statistics
- Explain the normal distribution and the Central Limit Theorem in plain words
- Read a confidence interval as a range of plausible values
Why we infer instead of just describe
Earlier you learned descriptive statistics — mean, median, std — which describe the data in front of you. But you rarely have everyone. To learn the average height of India, you cannot measure 1.4 billion people; you measure a sample of, say, 1,000 and reason about the population (everyone). That leap — from a sample to a confident statement about the whole — is inferential statistics. It is the actual “science” in data science.
| Term | Plain meaning | Example |
|---|---|---|
| Population | Everyone/everything you care about | All Indian adults |
| Sample | The few you actually measured | 1,000 surveyed adults |
| Descriptive stats | Summarise the sample | This sample’s average |
| Inferential stats | Conclude about the population | India’s likely average |
The normal distribution — the famous bell curve
Many real measurements (heights, exam scores, errors) cluster around an average, with fewer and fewer values as you move away — a shape called the normal distribution, or the bell curve. Its handy rule of thumb: about 68% of values fall within one standard deviation of the mean, and about 95% within two.
import numpy as np
# 10,000 exam scores: average 70, standard deviation 10
np.random.seed(0)
scores = np.random.normal(loc=70, scale=10, size=10000)
within_2_std = ((scores > 50) & (scores < 90)).mean()
print('mean of sample :', round(scores.mean(), 1))
print('fraction within 50-90 (±2 std):', round(within_2_std, 3))Note: Output: mean of sample : 70.0 fraction within 50-90 (±2 std): 0.954 We drew scores from a bell curve centred at 70 with std 10. About 95.4% landed within two standard deviations (50 to 90) — matching the “95% within 2 std” rule. That rule is what lets us turn a spread into a probability.
The Central Limit Theorem — why averages are trustworthy
Here is the idea that makes inference work, the Central Limit Theorem (CLT): even if your raw data is not bell-shaped, the average of a sample behaves like a bell curve — and bigger samples give averages that hug the true population mean more tightly. In plain words: averages are stable and predictable even when individual values are wild. That is why a poll of 1,000 people can estimate millions.
import numpy as np
np.random.seed(1)
# A wildly skewed population (NOT bell-shaped)
population = np.random.exponential(scale=10, size=100000)
# Take 1000 samples of 50, record each sample's mean
sample_means = [np.random.choice(population, 50).mean() for _ in range(1000)]
print('population mean :', round(population.mean(), 2))
print('mean of sample means:', round(np.mean(sample_means), 2))
print('std of sample means :', round(np.std(sample_means), 2))Note: Output: population mean : 10.01 mean of sample means: 10.02 std of sample means : 1.39 The raw population is lopsided, yet the 1,000 sample-means cluster tightly around the true mean (10) with a small spread (1.39). That is the CLT: sample averages are well-behaved and bell-shaped, so we can trust an estimate from a sample.
Confidence intervals — an honest range, not a single guess
A single sample mean is one guess and is almost never exactly right. A confidence interval is more honest: instead of one number, it gives a range of plausible values for the true population mean. A “95% confidence interval” means: if we repeated the sampling many times, about 95% of the intervals we built would contain the true value.
import numpy as np
from scipy import stats
np.random.seed(2)
sample = np.random.normal(loc=70, scale=10, size=100) # 100 measured students
mean = sample.mean()
sem = stats.sem(sample) # standard error of the mean
low, high = stats.t.interval(0.95, len(sample)-1, loc=mean, scale=sem)
print('sample mean :', round(mean, 1))
print('95% confidence interval:', round(low, 1), 'to', round(high, 1))Note: Output: sample mean : 69.5 95% confidence interval: 67.6 to 71.5 Our 100 students averaged 69.5, but the honest claim is: the true average for all students is plausibly between 67.6 and 71.5. We report the range, not just 69.5 — because a sample is never the last word.
Tip: A wider confidence interval means more uncertainty; a narrower one means more certainty. The fastest way to narrow it is a bigger sample — which is exactly what the Central Limit Theorem promises.
Watch out: A 95% confidence interval does not mean “95% of the data lies here” or “95% chance the true value is in this exact interval”. It means the method captures the true value 95% of the time across many repeats. Subtle, but it is the classic interview trap.
Q. What is the Central Limit Theorem’s key promise?
✍️ Practice
- Generate 5,000 normal values with
np.random.normaland check what fraction fall within one standard deviation of the mean (expect ~68%). - Draw a sample of 200 from any data and compute its mean and a 95% confidence interval; write the interval in a sentence.
🏠 Homework
- On a real numeric column, take a sample of 100 rows, compute its mean and a 95% confidence interval, then compare to the full-column mean. Write 3 lines on whether the interval captured the true mean and why.