5 common pitfalls in A/B testing - Leadfront

Written by Anton Nordström | Jun 10, 2024 10:00:00 AM

A guide for marketers

Going with your gut can only get you so far. When reality gets complicated, when variables pile up and experience isn't enough - well, then we have to rely on the scientific method.

Or, at least, as much of it as is applicable to us marketers. It often comes in a form we've all heard before: the 'A/B test'.

Armed with the A/B test, we can fight arbitrariness in decision-making and constantly improve our own work.

The problem arises if, like me, you're not a data analyst (or didn't have math as a favorite subject in school). Because it's easy to trust the system you're working in or a tool you've found - without understanding what you're doing and why.

If you get it wrong, it's even worse than if you didn't test at all, because now it seems you have reliable data. But you actually knew more before you started testing. Then you knew with 100% certainty... what you didn't know.

Let me tell you about 5 common pitfalls in A/B testing. So you can avoid falling into them.

Pitfall 1: Sample Size and Timing in A/B Testing

Just because I took two courses in statistics a few years ago doesn't mean I remember (or even less, understand) statistical methods.

A common mistake I've made myself is this. Not understanding what an A/B test is and because of that, ending the experiment too early.
‍

"+10% it says in green, great. Then we're done"

"What?! We have to wait 2 MONTHS? Isn't two weeks enough?"

No, you are not!

No, it's not!

‍

What we do is called 'hypothesis testing' and the idea is this: to attribute any change to the experiment and not pure chance.

To do this, we need a sufficiently large sample from a population - a random sample.

‍

Its size is determined by three factors.

Significance level: how likely we are to get a false positive. The default is to set 0.05 or 5%.
Power: how likely we are to identify the effect if it really exists.
Minimum observable difference (MDE): the smallest effect between the experimental group and the control group we want to be able to measure.

‍

And the relationship between them and the sample size is.

‍

Significance level decreases → Larger sample size

Power increases → Larger sample size

MDE decreases → Larger sample size

‍

In summary, the more certain we want to be of the result and the smaller the differences we want to see, the larger the sample needs to be.

But how does this relate to ending A/B tests too early?

If we do the calculations and find that we need a sample of 10 000 people in total, 5 000 each in the control and experimental groups, then we can only be "sure" of the result when we reach this number.

For example, if I have a new onboarding journey I want to test against the old one, then 5,000 people need to go through the new one and also the old one before I can confidently look at the numbers and say what performs best.

A good tip to find the right size of your sample is to use an easy-to-use tool. I personally like Optimizely's "Sample Size Calculator"

Pitfall 2: Randomization and its challenges

When I first got acquainted with A/B testing, I somewhat naively assumed that it would be easy to just divide users randomly into different groups. But in fact, it can be more complicated.

To illustrate this, imagine that we divide the participants of a cooking class into two groups. We want to compare different teaching methods. If by chance we put all or a majority of experienced cooks in one group and beginners in another, how can we distinguish whether it is the teaching method or the previous knowledge of the participants that produces results?

In a marketing context, this is even more important. Imagine you have a group of 100 users and 5 of them account for 30% of all traffic on your website. If by chance these 'heavy users' end up in the same test group, it will skew the results considerably.

If you are a data scientist, you may feel that this is not too complex, but for the marketer it can be an unexpected problem. A problem that we don't always have the solution to.

Therefore, my recommendation is that if the test is important and you have good data, then do the selection together with your analysis team. Relying on the sample from some CRM/MA system is not always enough. So make sure to consult the experts.

Another related difficulty is to make sure that users stay in the same group (experiment or control) and do not jump between them.

For example: you have set up a test where you test engagement over a series of 3 pop-ups on your website. The experiment and control groups differ in the timing of these messages. If it takes more than one session to see all these pop-ups, it is important that users are not randomly assigned to the experiment or control in their next session. They need to remain in the same group until the test is completed. Otherwise, it risks undermining the results.

For many tools, this is not a problem. But don't assume that all MA/CRM systems can cope with not mixing groups for larger tests either. Read up!

Pitfall 3: Minimize Variables in A/B Testing

When performing A/B testing, it is important that we minimize the influence of other variables as much as possible. One that easily sneaks under the radar is load time.

Let's take another pop-up example. Imagine you want to test the effect of using a video instead of a still image in an important pop-up for your business. You've heard that webm is the new hot format that offers great quality in a smaller file.

It's important to remember that the video will still (no matter how effective) have a longer loading time compared to a regular image. This introduces a new variable - the loading time - which is not really directly related to the content of the video or the difference between moving and static content.

If the aim of the test is to compare the effectiveness of video vs. still image, without the loading time having an impact, you should consider introducing an equivalent delay for the control group (the one with the image). This will allow you to compare the results more fairly and see if it is the content of the video, rather than its loading time, that influences user behavior.

But it is rare to be able to account for all variables (or indeed, almost impossible). In reality, the possibility for clinically controlled lab experiments is quite limited.

We can get around this by repeating the experiment several times and seeing if the results roughly hold up. A clear trend? Yes, then we can put more faith in the result.

Ultimately, it's about looking out for and accounting for variables that can distort the results. This is the only way to ensure that your A/B tests can be trusted.

Pitfall 4: Treating all segments equally

'Average' is a word to savor. It exists in a delicate group of words and ideas that conceal and mystify more than they describe.

Life expectancy in Sweden is 85 years (women) and 81 years (men) and is slowly increasing¹. But the richest tenth live nine years longer than the poorest tenth^.2.

"Average" is not the whole answer.

So we should beware of that trap when testing. We can easily imagine that an experiment has a large positive effect on new users but a negative impact on the majority of mature users. Looking at the whole population then hides all the differences behind a big number that says nothing about anyone.

Therefore, it is important to be clear in the design of your test. What is our hypothesis? Are we interested in seeing an effect on the whole population - our whole customer/user base or parts of it?

Let's run one more example.

Suppose you are conducting an A/B test to increase engagement among users of your podcast app. You test to more aggressively inform about new podcasts and episodes. It is also the case that the customer base can be divided into two user groups: frequent listeners and those who listen more sporadically.

Now, if we just look at the averages across the entire user base after the test, we don't get the full story. Say the average shows a small increase in listening time. That sounds good but it hides the nuances of how different groups actually reacted to the change.

For frequent listeners, the increase might be significant - they might appreciate the new content and listen even more. But for occasional listeners, it might not make a difference, or worse, they might listen less. The average tells us nothing about this.

It's like saying "the average temperature of patients is normal" without noticing that some of them have a fever while others are freezing to death.

That's why it's so important to break down the data and look at the segment level. Or create the test with a clear target group in mind. It helps us see which strategies work for which users.

👉 Bonus tip: Braze has a great feature called 'personalized variant' where an initial mailing is used to match customers by various attributes with which version of a mailing is likely to work best for them. Check it out!

Pitfall 5: The importance of documentation in A/B testing

Without documenting what you plan, what you did and your results, it's hard to succeed. No, let me say that it is almost superhuman.

If, like me, you don't have a magical memory, you need one thing above all - documentation.

Easier said than done because the problem is that the document needs to be maintained. It also needs to be integrated within the way of working.

To avoid it becoming just another nice initiative that fizzled out, you need to decide that, within the framework of how you already work, you will produce and discuss the results. I am not talking about once a quarter but once a month or maybe every two weeks.

On one of my assignments, we went through it at the beginning of each new month together with all the campaigns for the month before. It was only then it went from aspiration to reality.

Summary.

In summary, we have reviewed five common pitfalls in A/B testing and how to avoid them. From the importance of understanding sample size and timing, to the challenges of randomization, minimizing variables, treating different segments, and finally, the importance of thorough documentation.

As a marketer, it provides a good foundation to confidently confront testing. I also hope it will make cooperation with any analytical functions easier.

‍

Have a good time!

View full post