A/B testing in software Development
Picture this scenario: you have just made some changes on your web application, and would like to know how effective they are in improving the users' experience. One option is to wait until a satisfactory number of users have interacted with the new version of the web application and compare their behaviour with the average user behaviour before the changes were implemented, but this raises two issues:
- you do not know whether the population that is currently visiting your website has the same distribution as the population that was using it before the changes, and therefore cannot know whether differences in the behaviour are due to the new modifications or differences in the population;
- you might be showing a worse version of the website, and have no way of quickly detecting this issue.
In this scenario, your best option might be using A/B testing. A/B testing, also known as split testing or bucket testing, consists in testing two different versions of a piece of content on your audience. In practice, you create two versions of the content, and randomly show one version to half of your audience, and another to the other half. If needed, you can also run multivariate tests, also known as A/B/n test, if you want to test more than one variation.
"The A/B test can be considered the most basic kind of randomised controlled experiment", says Kaiser Fung, the man behind several books including Number Sense: How to Use Big Data to Your Advantage. "In its simplest form, there are two treatments and one acts as the control for the other", he adds. Make sure to correctly estimate the size of your sample so that the result is correct and not due to background noise.
Originally used mostly in marketing, A/B testing is now commonly employed in many domains such as software engineering and data science. Indeed, the best thing about A/B testing is that it doesn't matter what kind of product or service you offer, nor the specific content that you want to modify and evaluate, you can always use A/B testing to learn more about your audience and make changes so that you are reaching them in the best way possible. Indeed, you can study whether the change under evaluation had a positive, negative, or neutral effect on visitor behaviour, and you can apply this to almost any step of software development. A/B testing is also commonly used in many large companies, Google ran its first test in 2000, analysing the "best" number of results per page, and they still actively use it (they reportedly ran over 7,000 tests in 2011). Other big names such as Booking.com, Facebook, and Amazon also regularly conduct controlled experiments.
In this blog post we will go through the benefits and limitations of A/B testing in software development, provide some practical guidelines, and show you why you should always develop with A/B testing in mind.
There are several benefits to A/B testing, and they are mostly related to the increased knowledge that you get about your audience and what they like. Indeed, A/B testing lets you increase users' engagement, create content more effectively, and implement machine learning models that are more trusted by the users.
Better know your audience
This is probably the most important point: by performing A/B testing, you can evaluate how your audience is interacting with your web application, their habits and needs. You can study their behaviour and understand what keeps them coming back to the platform and what, on the other hand, is not relevant to them.
Makes analyses easier
If your application is A/B testing-ready, it makes your life much easier when you have to analyse the effects of any changes. This enables you to focus on the specific aspect that might be causing specific users' behaviour: for instance, you might use A/B testing to highlight what causes buyers to abort a cart in an e-commerce application: there can be a variety of reasons, such as hidden costs, poor layout, slow application, etc. With A/B testing, you can more easily find the real cause and target it.
This is a direct consequence of the first point: better knowing your audience and their preferences enables you to keep them more engaged on your application. Indeed, you can study what is the type of content that the audience prefer and you can concentrate more on it, without spending time, money, and energy on content that is not really consumed by the audience.
Helps identify issues
Building a web application - or any piece of software - is a complex task. There are many components that have to fit together, and it might be difficult to have a coherent vision of everything that appears on the platform. A/B testing allows to identify issues such as poor UX design or specific components that do not fit well with the rest of the platform.
With A/B testing you can basically evaluate any piece of content that you put on your platform, and keep only the most effective ones (i.e. the ones that work better for your audience). It is quite intuitive to think about this for blog posts, or any other content that the users consume, but this applies to different components of your applications, such as machine learning models. Indeed, machine learning models are generally evaluated with off-line metrics but, for customisation models, there is no better way to evaluate them than by directly studying how the audience interacts with them. A/B testing enables you to do just that. This blog post from headspace describing how they evaluated the performance of their recommender system is an interesting read about A/B testing for evaluating machine learning models.
We have seen some of the major benefits of A/B testing but, unfortunately, it has some limitations of its own as well, and it is important to be aware of them.
Data drift is a term commonly used in machine learning, to describe how the data distribution might change over time and therefore the performance of a machine learning model in production slowly gets worse. This is a problem for A/B testing as well, since the audience preferences might change over time, and it is important to keep testing them to immediately detect when this might be happening.
No fine-grained customisation
A/B testing is very helpful in understanding the preferences of your specific audience, but it only studies general trends, and it cannot provide fine-grained customisation or fine-grained analysis (e.g., user profiling). If that is your objective, it might be better to have a look at recommender systems as well.
Limited exploration of alternatives
A/B testing is focused on comparing two (or a few) predefined alternatives. This helps in optimising existing options, but it may not provide an opportunity for exploring radically different ideas that go beyond the tested variations. For this reason, exploratory research and user feedback should be used to complement A/B testing to foster more innovation.
A/B testing is explicitly designed to evaluate specific changes or variation in a controlled environment. However, it might be not suitable for studying more complex or interconnected factor; in other words, A/B testing is generally not capable of capturing the potential inter-dependencies between multiple variables.
Duration, external factors, and sample size
A/B testing is a constant trade-off between two forces: on one hand, you want to have tests that run as long as possible to better evaluate the changes under testing; on the other, if the experiments run for too long, there might be changes in the user behaviour that affect the results (e.g. seasonality, market trends, etc.). For this reason, it is important to perform retesting, in order to have an evaluation that is as robust as possible to external factors.
Also, A/B testing requires a sufficiently large sample to provide statistically significant results, and this might force you to keep the experiment running for longer, as obtaining a large enough sample size can be challenging, especially for niche or specialised products.
When should you use A/B testing?
Short answer: *always*; realistically, whenever you can. Although it is unrealistic to commit to A/B testing on everything you put out, it would be ideal to conduct continual A/B testing, and it is certainly crucial to perform it on changes that might have bigger impacts on the audience.
Pick one variable to test
A/B testing works best if only one element is different for each version. Otherwise, it might be difficult to understand which are the components that have the effects on the audience.
Divide your audience equally and randomly when possible
This is an important step to reduce as much as possible the differences between the populations that receive the two different versions under test. Indeed, if the two populations are different, such differences might be the cause on any difference visible in the outcome of the A/B testing (instead of the change in the content). In some cases, you might want to use a deterministic split, but this is the better choice only if you are trying to study how users from a specific demographic react to a change.
Possibly use blocking
Creating completely randomised splits might cause some issues, such as having one set that contain more mobile users than the other, which might affect the accuracy of the evaluation. The best way to avoid such biases is to divide visitors by e.g. desktop and mobile, and than randomly assign them to specific sets. This, which is referred to as blocking, should be done for any aspect that might bias the results of the evaluation.
Test versions simultaneously, and give tests time to run
It is important to test versions simultaneously (whenever possible) in order to minimise the differences between the populations that consume the two versions of the content.
Also, you must remember to allow your tests to run for enough time to produce useful results.
Measure your results, and take action based on them
It might seem obvious, but you should always allow for enough time to extensively evaluate and measure the results of your A/B testing. In order to do this, it might be best to decide in advance the metrics to use at evaluation time.
Also, you should always use what you learn, even if it is that the original version performed better than any of your other tested versions. Apply what you learn about your platform and your audience, not only to the component under evaluation, but also to other components which might exhibit similar behaviours.
As we have seen above, users' preference might change, and the results of your tests might be affected by seasonality or market trends. It is important to perform retesting, to understand whether the results keep holding true or something is changing.
Beware spurious correlations
Complex tests are useful, but they are sometimes not efficient. Indeed, looking at too many metrics at a time can results in spurious correlations.
In this blog post we have seen which are the benefits and challenging of using A/B testing in software development. We hope that you found this helpful and informative, and will give A/B testing a shot when developing new components for your application! It requires some initial efforts, but can certainly provide useful insights into your audience and application.