How to Know if Your Split Test is Valid (Hint: Statistics Can Lie)

Did you know that conversion optimization can actually decrease your profit?

When you run tests, it’s easier than you might think to misinterpret the results and make faulty conclusions.

The “winner” that you pick might end up being the long-term loser if you’re not careful.

Equally as bad, if the lessons you take from that test are incorrect, you will multiply your losses as you implement tactics based on those bad conclusions elsewhere in your business.

If you’ve never taken an introductory statistics course, don’t worry — I have you covered. We’ll also go over the ways statistics in your split tests can be deceiving.

Stats 101: A Basic Crash Course

If you never had to take Stats 101 at college or university, you missed out on some exciting stuff… um, not really.

However, there are still a few things from that course you need to know before split testing will make any real sense.

I’m going to go over these concepts quickly now so you’re not confused in the future. If you are already a stats pro, just skip down to the next section.

How to Know if Your Split Test is Valid (Hint: Statistics Can Lie) @DaleCudmore

Click to tweet

What in the World Is a Confidence Interval?

Whether you use a conversion tool like Optimizely, or a simple web app like IsValid, you’ll notice that conversion rates always come with a range.

For example, check out this screen shot of a sample test:

The conversion rate is currently 4.3%, but there’s a range underneath it from 0.6 to 8.0%. This means that given a large enough sample, the conversion rate could fall anywhere in that range.

Now this doesn’t mean that the far ends (0.6 or 8.0) are likely, it just means that they are possible.

How Important is Significance?

Statistically significant — have you heard that term before?

The significance of a test tells us how confident we should be that we have the right result when we are picking from 2 or more options.

When you run a basic A/B test, you’ll have a confidence interval for each option. In many cases, these 2 confidence intervals will overlap.

See the example below, for example. The original could have a conversion rate of up to 5.6%, while the variation (the current winner) could have a conversion rate as low as 0.6%.

Does this mean the current results are useless? No, not at all.

But it means that we need to calculate the significance of the test in order to determine how confident we can be when we choose the variation as the winner.

According to the tool, the significance is currently 91.1%. This means that 91.1% of the time, the variation is the best performing option. However, that leaves 8.9% of the time where the original is actually the best.

In reality, tests are typically run until a 95% or higher significance is achieved. Even at 95%, 1 in 20 tests will end up with you picking the worst option. While it would be ideal to test everything to a 99%+ significance level, it’s not always possible due to traffic or time limitations.

A note on significance: If you can only get to 95% significance in most tests, that’s not ideal, but it’s okay. Just understand that not every lesson you learn is going to be correct, and that you should expect a conflicting result once in a while.

A Critical Variable: Sample Size

Flip a coin 10 times, and you’re fairly likely to get lopsided results, like 3 heads (30%) and 7 tails (70%), even though in theory, they should be split 50/50.

Flip that coin 100 times, and you’ll get closer to the real probability, something like 48 heads and 52 tails.

See where I’m going with this?

The larger the sample size you have when running a test, the more accurate the results are.
Your sample size is one of the most important factors in determining the significance of a test.

There are plenty of simple sample-size calculators out there that you can use for free. Just about every conversion optimization tool has a calculator built-in as well.

Here’s a look at Optimizely’s free web calculator:

In this case, you’d need to run the test until you had 10,170 samples (views) for each option.

So that’s Stats 101 in about 5 minutes. Let’s move on to determining if your split test results are actually valid.

Sample Size is Not Always Accurate — Here’s Why…

Here’s what most business owners do when split testing:

Calculate required sample size
Run test for that long
Pick a winner from the results

That doesn’t seem crazy, does it?

But there are some serious flaws that could have negative effects on your bottom line.

You MUST Segment Your Traffic

Segmenting simply means to divide something up.

In the case of web traffic, you can segment in three main ways:

By source: Traffic comes from different places. Google, Bing, social media, email links and more. Visitors from different sources of traffic tend to behave and convert differently.
By behavior: Did they come to the test page from a certain page on your site? Do some of your visitors read 3+ pages on your site on their first visit or visit at least 5 times a month?
By outcome (conversion): Which visitors convert the best? If applicable, which of those later buy your more expensive service or product?

There are occasions where you can segment by 2 or 3 of the above types all at once. It just depends how detailed you’d like to go.

Getting back to testing validity, the point is that your results can be invalid if you do not pay attention to segments.

Example time…

Your sample size calculator says you need 10,000 visitors for each variation. You do that, and see that one side is the clear winner. However, after digging a bit deeper, you see that the winner had an extra 2,000 visitors from search engines (because of variance, just like flipping a coin). You find that search-engine visitors convert really well on your site, thus skewing the results.

After running the test until the amount of traffic from each traffic source levels out, you see that the original is actually the best — mistake avoided.

You need to consider variation in the most important segments for your business and test before declaring a winner. It may be the sources of traffic, certain behaviors, country, or more. Learn how to segment your visitors with Google Analytics.

Sample Size Does Not Always Reflect a Business Cycle

Check out just about anyone’s analytics reports and what do you see? Massive variation in traffic numbers based on the day. Usually a peak during the week, followed by a massive dip on the the weekend.

However, it’s not always just numbers. You get different visitors based on the day. If you dig a bit deeper, you’ll likely notice that the traffic you get from different sources also changes a bit from day to day.

google analytics dashboard

A “business cycle” typically refers to one week for most businesses, although it may be different for yours. Whatever time period encompasses most types of your typical visitors is a business cycle.

What happens when your sample size calculator says to run a test for 10,000 impressions, and you have 20,000 visitors in one day?

You finish the test in one day. But this doesn’t take your business cycle into account. You might have a valid result for visitors on a Monday, but not necessarily for visitors overall.

This particular testing problem isn’t usually a problem unless you have great traffic numbers. Nonetheless, be aware of it.

Takeaway:
Always run a test for at least 1 business cycle.
You can always test more variables (multivariate testing) if you have excess traffic.

Assessing the Validity of a Test

If you’re feeling a bit overwhelmed, don’t worry. We can simplify this process into 3 main steps:

1. Calculate Your Minimum Sample Size

Determine what level of confidence (significance) you’d like in your test’s results, and calculate a sample size based off of it. This will be the minimum number of impressions/visitors that your variations need.

2. Check for Discrepancies in Segments

Before the test is complete, you should already know how to segment the visitors to your website. Once the minimum sample size is complete, dig deeper to determine if there are any major discrepancies. If so, keep the test running.

3. Assess Your Business Cycle

Your tests need to run for a whole number interval of business cycles. If the minimum sample size is up after half a cycle, or 1.75 cycles, keep it running until the next whole number (1 or 2 cycles respectively in this example).

That’s all there is to it, 3 fairly easy steps. Statistics are your friends, as long as you understand them.

Do you have any questions? Leave them below and I’ll answer them the best I can.

Read other Crazy Egg posts by Dale Cudmore

Dante Godfrey

How to Know if Your Split Test is Valid (Hint: Statistics Can Lie)