Is Your Test Significant?
Drop in your control + variant numbers. Get conversion rates, lift, z-score, p-value, and the sample size you need for 80% power.
Inputs
What this calculator does
This tool answers two questions in one screen. First: given my control and variant conversion data, is the observed lift statistically significant or is it noise? Second: given my baseline conversion rate and the minimum effect I'd care about, how many users per arm do I need to run a properly- powered test? Both questions get the same standard frequentist treatment used by Optimizely, VWO, Google Optimize (RIP), and most in-house experimentation platforms.
The math behind it
Significance: a two-proportion z-test. The z-score measures how many standard deviations apart the two conversion rates are; the p-value is the probability of seeing that gap (or bigger) if the variants were actually identical. Convention: p < 0.05 (95% confidence) is the publish threshold. Sample size: derived from the baseline rate, the minimum detectable effect (MDE), and the desired statistical power (typically 80%) using the formula from Cohen's sample-size formula.
Common mistakes the calculator helps avoid
Peeking.If you check significance every day and stop the test the moment p < 0.05, you'll declare wins that don't exist. The published p-value assumes a single check at a pre-committed sample size. Use the sample- size estimator first, then don't look until you hit it.
Under-powered tests. A 1% baseline conversion rate and a 10% relative MDE needs roughly 31,000 users per arm for 80% power at 95% confidence. Most landing-page tests are run with 2-3K users per arm and declared inconclusive; the test was simply too small to find the effect even if it existed.
Multiple-comparison inflation. Testing five variants against a control at 95% confidence each gives you a ~22% chance of a false positive somewhere. Bonferroni-correct or pick one variant.
When to run an A/B test (and when not to)
A/B tests pay off when (a) you have enough traffic for the math to work out in under a month, (b) the change is reversible if the test fails, and (c) the cost of being wrong is meaningful. Skip the test when traffic is below a few thousand conversions per month total: you'll get a faster signal by talking to users than by running an under-powered test for 8 weeks. For early-stage product decisions, the Lean Startup validated-learning loop beats A/B testing every time.
Two further references
For a longer treatment of the statistics, see Evan Miller's A/B testing essays, the most-cited practitioner reference on the topic. For Bayesian alternatives to the frequentist approach this calculator uses, see Split.io's framework.
Embed this calculator
Add this to your blog, course, or internal docs. Free, no attribution removed, branded back to gtm-labs.co.
Ready when you are.
Discovery calls are 20 minutes. First one's on me.