A/B Testing Sample Size: How to Calculate What You Need
Master the art of calculating sample sizes for A/B tests. Learn the formulas, use practical examples, and avoid common pitfalls.
Why Getting Sample Size Right Is Non-Negotiable
Imagine flipping a coin twice, getting heads both times, and concluding the coin is rigged. That is essentially what happens when you run an A/B test without proper sample size calculations. You end up making confident decisions based on insufficient evidence.
Getting sample size right is not just a statistical nicety. It is the difference between reliable insights and expensive guesswork.
The Four Pillars of Sample Size Calculation
Before diving into formulas, you need to understand the building blocks:
Statistical Significance (The Alpha Level)
This is your tolerance for false positives. The industry standard is 95% confidence, which translates to an alpha of 0.05. In plain English: you are accepting a 5% chance of seeing a difference when none actually exists.
Should you aim higher? Sometimes. For high-stakes decisions like pricing changes, you might want 99% confidence. For lower-risk tests, 90% might suffice.
Statistical Power
Power measures your ability to detect real effects. The standard is 80%, meaning if your variation truly is better, you have an 80% chance of your test correctly identifying that improvement.
Why not 100%? Because higher power demands exponentially larger sample sizes. The 80% threshold represents a practical balance between certainty and feasibility.
Minimum Detectable Effect (MDE)
This is where strategy meets statistics. MDE represents the smallest improvement you consider worth detecting.
Want to catch a 5% improvement? You will need a massive sample. Willing to only detect 25%+ improvements? Your sample requirements shrink dramatically.
Choose your MDE based on business impact. If a 5% improvement translates to millions in revenue, it is worth the traffic investment to detect it reliably.
Baseline Conversion Rate
Your current conversion rate directly influences sample size requirements. Higher baseline rates generally require smaller samples because there is more signal relative to noise.
This number must come from reliable historical data. Using a rough estimate here undermines everything that follows.
The Formula (And a Simpler Alternative)
For the statistically curious, here is the full formula for two-tailed tests at 95% significance and 80% power:
n = 2 x ((1.96 + 0.84) squared x p x (1-p)) / delta squared
Where:
n = sample size per variation
1.96 = z-score for 95% confidence
0.84 = z-score for 80% power
p = baseline conversion rate
delta = absolute MDE (baseline multiplied by relative MDE)
For those who prefer practicality over precision, this simplified version gets you close:
n = 16 x p x (1-p) / delta squared
Real-World Calculations
Let us work through three scenarios you might actually encounter.
Scenario One: Optimizing E-commerce Checkout
Your checkout converts at 3%, and you want to detect a 15% relative improvement (meaning a lift from 3% to 3.45%).
The absolute MDE is 0.45% (or 0.0045 as a decimal).
Plugging into our simplified formula:
n = 16 x 0.03 x 0.97 / (0.0045 squared)
n = 16 x 0.0291 / 0.00002025
n = approximately 23,000 per variation
Bottom line: You need roughly 46,000 total visitors to run this test reliably.
Scenario Two: SaaS Free Trial Optimization
Your trial signup page converts at 8%, and you are looking for a 20% relative lift (from 8% to 9.6%).
The absolute MDE is 1.6% (or 0.016).
n = 16 x 0.08 x 0.92 / (0.016 squared)
n = 16 x 0.0736 / 0.000256
n = approximately 4,600 per variation
Bottom line: About 9,200 total visitors gets you there.
Scenario Three: Newsletter Signup
Your newsletter form converts at 2%, and you want to detect a 25% relative improvement (from 2% to 2.5%).
The absolute MDE is 0.5% (or 0.005).
n = 16 x 0.02 x 0.98 / (0.005 squared)
n = 16 x 0.0196 / 0.000025
n = approximately 12,500 per variation
Bottom line: Plan for 25,000 total visitors.
Quick Reference: Sample Sizes at a Glance
This table shows visitors needed per variation (multiply by 2 for total sample) at 95% confidence and 80% power:
| Baseline Rate | 10% Relative Lift | 20% Relative Lift | 30% Relative Lift |
|---|---|---|---|
| 1% | 315,000 | 79,000 | 35,000 |
| 3% | 103,000 | 26,000 | 11,500 |
| 5% | 61,000 | 15,000 | 6,800 |
| 10% | 29,000 | 7,200 | 3,200 |
| 20% | 13,000 | 3,200 | 1,400 |
Notice the pattern: higher baseline rates and larger expected lifts dramatically reduce your sample requirements.
The Most Expensive Mistakes in A/B Testing
The Peeking Problem
You launch a test on Monday. By Wednesday, results look promising. By Thursday, you are tempted to call it. Do not.
Every time you check results and make a decision, you inflate your false positive rate. A test showing 95% significance on day three might stabilize at 60% by day fourteen. The "winner" you declared prematurely could easily be a loser.
If you absolutely must peek, use sequential testing methods designed for ongoing analysis. Otherwise, commit to your predetermined sample size and resist the temptation.
Using Contaminated Baselines
Your baseline conversion rate needs to come from the same traffic source, measured over a similar time period, with seasonal factors accounted for.
Using last year's Black Friday conversion rate as your baseline for a February test? That is a recipe for miscalculated sample sizes and unreliable results.
The Multiple Testing Trap
Running three tests simultaneously sounds efficient. But without proper corrections, you are dramatically increasing your false positive rate.
If you run five tests at 95% confidence, you have roughly a 23% chance that at least one will show a false positive. Statistical corrections like Bonferroni adjustments exist for this reason, but they require even larger sample sizes.
Forgetting That Segments Need Their Own Math
Planning to analyze results for mobile users aged 25-34? That segment needs its own sample size calculation.
Your site might get 100,000 monthly visitors, but if your target segment is 5% of that traffic, your effective sample is just 5,000. Many tests become infeasible once you account for the actual audience you want to study.
Advanced Decisions Worth Understanding
One-Tailed Versus Two-Tailed Tests
Two-tailed tests (the default) ask whether variation B is different from A. It could be better or worse. This is the more conservative approach and the industry standard.
One-tailed tests ask only whether B is better than A. They require about 20% less sample but carry a crucial assumption: you are certain B cannot perform worse. In practice, this certainty is rare.
Stick with two-tailed unless you have a compelling reason not to.
Frequentist Versus Bayesian Approaches
Frequentist testing (the traditional approach) requires calculating sample size upfront and running the test to completion. It provides clear stopping rules and is well-understood across the industry.
Bayesian testing allows for more flexibility. You can legitimately stop when confidence reaches a predetermined threshold. But it requires deeper statistical expertise and is easier to misapply.
Most organizations should start with frequentist methods and explore Bayesian approaches as their testing program matures.
Recommended Tools and Calculators
Free and Reliable
Evan Miller's Calculator is simple, accurate, and has been the go-to resource for years.
Optimizely's Calculator offers a friendlier interface with helpful explanations.
AB Test Guide Calculator adds test duration estimates based on your traffic.
Built Into Testing Platforms
Most A/B testing platforms include sample size calculators. Google Optimize, VWO, Optimizely, and Convert all provide this functionality. Use them, but verify their assumptions match yours.
When Sample Size Requirements Seem Impossible
Sometimes the math tells you that you need 500,000 visitors for the test you have in mind. Here are your options:
Accept a Larger MDE
Instead of detecting 5% improvements, target 15% or 20%. This shifts your focus toward bolder, higher-impact tests. Sometimes that is exactly what you should be testing anyway.
Lower Your Confidence Threshold
Dropping from 95% to 90% confidence reduces sample requirements. This trade-off makes sense for exploratory tests where the stakes are lower. For revenue-critical decisions, stick with 95%.
Reduce Statistical Power
Moving from 80% to 70% power means accepting a higher chance of missing real effects. This is acceptable for initial screening tests where you plan follow-up validation.
Test on Higher-Traffic Pages
That product page you wanted to test gets 500 visitors per month. Your homepage gets 50,000. Sometimes the right move is to test where the traffic is.
The Takeaway
Sample size calculations are not optional. They are the foundation of trustworthy A/B testing. Before you launch any test, know the answer to these questions:
- What is my baseline conversion rate? Use reliable, recent data.
- What is the smallest improvement worth detecting? Be realistic about business impact.
- How many visitors do I need per variation? Do the math or use a calculator.
- How long will that take at my current traffic level? If it is six months, reconsider your approach.
When results seem too good to be true after three days, they probably are. When in doubt, err on the side of larger samples. An inconclusive test is frustrating, but a wrong conclusion is expensive.
Related Posts
How Much Monthly Traffic Do You Need for A/B Testing?
Learn the minimum traffic requirements for statistically significant A/B tests and how to optimize testing with lower traffic.
Where to Start with A/B Tests: A Practical Guide for Beginners
Not sure where to begin with A/B testing? This guide walks you through prioritizing tests, setting up your first experiment, and avoiding beginner mistakes.