A/B Testing: Statistical Significance
A higher number does not always mean a real winner — statistical significance is how you tell a true improvement from random luck.
What you will learn
- Explain why a small A/B difference can be pure luck
- Define statistical significance and the 95% confidence idea
- Use sample size and a simple check before declaring a winner
The trap: luck looks like a win
In the A/B testing lesson you compared two versions and kept the higher one. But there is a hidden danger. Flip a fair coin 10 times for "version A" and 10 times for "version B" — one will probably get more heads, purely by chance. If you call that the "winner", you have learned nothing. The same thing happens with small A/B tests: a tiny lead can be random luck, not a real difference. Statistical significance is the tool that tells the two apart.
What statistical significance means
Statistical significance answers one question: "If A and B were actually identical, how likely is it we would see a gap this big just by chance?" If that chance is very small (commonly under 5%), we say the result is significant — we are confident the difference is real, not luck.
The flip side is confidence, usually set at 95%. A "95% confidence" result means: if nothing were really different, we would see a gap this large less than 5% of the time. So we are 95% sure version B genuinely beats A. Below that bar, we treat the test as "no clear winner yet".
Why sample size is everything
The single biggest factor is how many people saw each version — the sample size. Small samples are noisy; big samples are trustworthy. Watch the same 1% gap become believable as the numbers grow:
| Visitors per version | A converts | B converts | Verdict |
|---|---|---|---|
| 100 | 4 (4%) | 5 (5%) | Meaningless — could easily be luck |
| 1,000 | 40 (4%) | 50 (5%) | Promising, but not yet certain |
| 10,000 | 400 (4%) | 500 (5%) | Significant — a real, trustworthy win |
The conversion rates are identical in all three rows (4% vs 5%). What changes is your confidence: 4 sales vs 5 sales proves nothing, but 400 vs 500 over 20,000 visitors is a result you can bank on.
A worked example
The clothing store tests a new checkout button. Here is how to judge it properly rather than just eyeballing the bigger number:
Version A: 5,000 visitors, 150 purchases -> 3.0%
Version B: 5,000 visitors, 200 purchases -> 4.0%
Step 1 Is the gap big? 4.0% vs 3.0% = +1.0 point (a 33% lift)
Step 2 Is the sample big? 5,000 each, 150-200 conversions -> yes
Step 3 Significance check (calculator): 99% confidence
Step 4 Verdict: significant -> roll out BNote: The gap is large, each version had thousands of visitors and hundreds of conversions, and a significance calculator reports 99% confidence (above the 95% bar). All three boxes are ticked, so this is a real win — ship version B with confidence.
How you actually check significance
You do not compute the maths by hand. You type four numbers — visitors and conversions for A and B — into a free A/B test significance calculator (many exist online) and it tells you the confidence percentage. Your job is to know what it means and to refuse to ship anything below 95%.
- Run the test long enough to gather a healthy sample (hundreds of conversions per version, usually 1 to 2 weeks).
- Enter visitors and conversions for A and B into a significance calculator.
- Read the confidence level. If it is 95% or higher, the winner is real.
- If it is below 95%, either keep running to gather more data, or accept there is no clear difference.
The mistakes significance protects you from
| Tempting mistake | What significance tells you instead |
|---|---|
| "B is ahead after 2 days — ship it!" | Too few people yet; the lead may vanish |
| "102 vs 100 conversions, B wins" | A coin-flip difference, not significant |
| "We peeked and B was up, so we stopped" | Stopping early on a lucky peak fakes a win |
Tip: Decide your sample size and end date before you start the test, and do not stop the moment B pulls ahead. "Peeking" and stopping at a lucky high point is the most common way teams fool themselves into shipping a fake winner.
Watch out: A result below 95% confidence is not a "small win" — it is no proven win at all. Shipping it is a guess dressed up as data. Either gather more data or call it a tie and move on to a bolder test.
Q. Version A gets 100 conversions and version B gets 103, each from a few hundred visitors. A significance calculator reports 60% confidence. What should you do?
✍️ Practice
- Version A: 8,000 visitors, 320 conversions. Version B: 8,000 visitors, 360 conversions. Calculate both conversion rates and the percentage lift.
- An A/B test shows B ahead at 88% confidence after one week. Write one sentence on whether you would ship it and why.
🏠 Homework
- Find a free online A/B significance calculator. Invent visitor and conversion numbers for two versions, enter them, and write down the confidence level and your ship / do-not-ship decision. Then change the numbers to larger samples and note how the confidence changes.