How A/B Testing Works in SMS Messaging
Have you ever wondered if your SMS messaging results could be any better? Did you ever see a huge change in your results after changing something? It can be scary trying to make changes and not knowing what will happen. But you can take the stress – and scariness – out of the process.
All it takes is a little planning, sending some SMS messages to your list, and some maths. Don’t worry though, I’m going to show you some online tools that do most of the maths for you. The method is called A/B testing, and it’s used in all sorts of industries.
In my past job-life, one of my main responsibilities was “designing experiments” to evaluate communications systems. It probably isn’t anything like the images that just popped up in your head when you read that – no beakers or lab coats. It involved looking at lots of different parts of our project, figuring out which ones to “tweak”, then doing a whole lot of maths to figure out how to test which “tweak” was better. If that sounds complicated, it was. Thankfully though, doing it for your messages will be much easier.
Introduction to A/B Testing
A/B testing sounds vague, but it’s a very specific thing. It involves looking at two versions of something, one called A and the other called B, and deciding which one is better based on rigorous maths. So, in terms of SMS messaging, A is one version of a message and B is a slightly modified version of the same message.
For example, here is a message we’ll call A:
XYZ Co: 10% off all coats until Friday. Use coupon OUTER10 at checkout.
Here would be a version called B:
XYZ Co: 10% off all coats today only. Use coupon OUTER10 at checkout.
See what I did there? I changed the offer to be more limited – today instead of Friday. It was a simple change that I’d expect would increase the urgency people felt when reading it. Ideally, that would result in higher redemption rates for the coupon. But would it? The only way to know is to do an A/B test.
Defining an A/B Test
So, A/B testing, also called split testing, is a method used to optimise results. In SMS messaging, those results could be redemption rates, opt-in rates, or attendance rates. It’s the objective you’re trying to achieve with your messaging.
To use this method, you’ll need to know some terminology. The basis for it all is statistical maths, so the words could be intimidating if you’re not someone who uses it on a regular basis. But that’s ok, I promise to make it easy to understand and use, even if you have Arithmophobia.
A sample is a small portion of a larger thing. Technically, a piece of a cake is a sample of the cake. In your messaging, the sample will be a small subset of your total list. For example, if you have 1000 people on your list, a sample could be 50, 100, or 238 of them. That number, whatever it is, is your sample size.
There are a couple of details you need to know about your sample size:
- The people in it need to be random. The maths behind this method are built on the assumption that the sample is random. In SMS messaging, this means you want to eliminate any bias as much as possible. For example, let’s say you have people aged 18-78 on your list. If you somehow ended up with people all over the age of 50, that could bias the results towards that demographic. That means if you sent out a message that did really well with that group, it might not do as well with the larger list that also has younger people.
- You need to calculate the size (don’t worry, it’s not hard). While the mix of people in the sample needs to be random, the sample size cannot be random. It needs to be large enough so that you know the test gives valid results. But you want also to keep it as small as possible. This is because you’ll send out the optimised version (whichever message, A or B, that gets a better result) to the remaining list. Once they see the better version of the message you’ll hopefully see the improvement you found with the A/B testing.
Later, I’ll share some online resources that will help you find the right sample size for your list.
If you’ve done any kind of test marketing, you’ll know the control is the message you’ve been using. It’s the original one that’s been doing well, but now you want to improve it. It is the standard you’ll use for determining if the new version of the message does better, or worse.
For example, if the current message gets a 20% redemption rate, you’ll want to see something better than that in the new version. If it doesn’t do better, then the control is still the best version (and still the control). But if you find the new one gets a 30% conversion, it now becomes the control you’ll use to test other changes against.
The variation is the name given to the message you changed. It’s called that because you “varied” something from the original, or control. In the example above, I changed the words related to the expiration of the offer (to make it more urgent).
The change could also be other things, like a coupon code in the control, and a link to a mobile coupon in the variation. It’s important to remember that a variation should only have one thing different from the control. With just a single change, you’ll know the reason why a message did better or worse than a control: It was obviously what you changed. If you change two things, you’ll never know which of the two things is responsible for the results, or if the message would have been better with just one of them.
Now, the word significant here has a dual meaning. It means what you think it means from normal usage – important or noteworthy. But it also has a mathematical meaning. And in A/B testing, it is the mathematical significance that matters. It’s what tells you if the change you made really mattered.
Here’s an analogy that might make it easier to understand. You probably know that if you flip a coin, you have a 50/50 chance of it landing face up or face down (heads or tails). If you flip a coin 100 times, you’d expect it to land face up about 50 times, and face down about 50 times. Now if you actually did the experiment with a coin, you might get it face up 52 times, and face down 48 times.
Would that mean something strange was happening, or is that difference something normal? From our everyday experience, I think you’d agree that it’s normal. But if you got 75 heads and 25 tails, then you’d say something is weird. Maybe the coin is weighted or has some imperfection that is causing it. But you’d know the difference in results was significant, and start looking for a cause.
The same is true with your A/B testing. You might see a 10% increase in redemptions, but before you jumped for joy, you’d want to be sure that the increase was real, and not just a random effect. That’s where the mathematical determination of significance comes in.
It takes some complicated maths to determine it, though. But thankfully there are some online calculators that make it easy. You enter your information and they will tell you if the result is significant. If it is, you know you’ll see the same type of result when you send the message to the entire list (not just the sample). If it isn’t, that means the result you got isn’t reflective of what the rest of the list would do.
I’ll list those online calculators at the end of the blog, with some advice on how to use them.
Standard or Confidence
Both of these terms mean the same thing. It is the percentage confidence (in the common sense of the word) you want to have in the results of your test. For most A/B testing I’ve seen, 95% is usually used. That means that the determination of significance is made with 95% confidence. Another way to look at is that you can be 95% sure that the answer you get is significant and meaningful.
Remember, the heart of all this is statistical maths. It deals with probabilities, so you can never be 100% certain. You can try and get 99.999% certain, but usually 95% is enough for marketing.
A term related to the confidence is the confidence interval. If you’ve ever looked at survey results you probably noticed they often have a “±” and some number in the small print. This is the error, or the confidence interval for that survey. For example, 48% of people surveyed said they support making ice cream mandatory for dessert on Tuesday (not really, I made this one up). If the confidence interval for this made-up survey is 4%, we can only be sure that somewhere between 52% and 44% (48 ± 4) of people actually agree ice cream should be mandatory.
So, for your A/B testing, the confidence interval is the amount of error you’re willing to accept. The smaller the number, the larger your sample size will need to be. The larger the interval, the smaller your sample size can be. Again, you don’t need to know how to calculate it, or even understand it right now. It will all make sense once you’ve played around with the online tools.
Now that you understand the terminology, we can look at some online resources that will make A/B testing easy for you to do.
There are two sets of online resources I’ll share here. The first can be used if you haven’t run any tests yet, or you’re starting fresh with new tests. The second set can be applied to the results of tests you’ve already run and have results for, but aren’t sure how to interpret them.
Before Your First Test
If you haven’t yet run any tests yet, then first thing you will want to do is figure out what your sample size needs to be. By doing this, you’ll know right from the beginning whether or not the results you get are significant. This is good because you won’t waste a lot of time running various versions of messages only to find out the results you get aren’t repeatable, or otherwise don’t get you the response you expected.
Before you start, you’ll need to have this information handy:
- Your total list size
- The confidence you want (usually 95%)
Once you have those numbers, visit this page by Creative Research Systems. Scroll down the page and you’ll see the Sample Size Calculator.
You can select either 95% or 99% confidence. For marketing, 99% is overkill, but feel free to play with the options to see how it affects the results. Next, you enter your confidence interval. For this calculator, the number needs to be between .1 and 50 (this is in percent, so 5=5% and .1=.1%). The final number you need is your total list size, which is called the population in this form.
In the image above, you’ll see I found the sample size for a list of 1000. I chose a confidence of 95% and an interval of 5. The result is a sample size of 278.
This means I need a total of 556 (278 x 2) people to conduct a A/B test with 95% confidence. I would send out the control message, A, to 278 people. And the variation, or message B, to a different 278 people.
If message B received 10% more redemptions, then I could interpret the results like this:
I can be 95% certain that message B would get between 5% and 15% more redemptions when I sent it to the entire list. The range is 5-15 because I chose 5 as my confidence interval, which corresponds to a 5% error in the results.
Another website with an online calculator is SurveyMonkey. The form is like the one above and if you scroll down the page, they share the maths behind the form too.
As I mentioned, the benefit of determining your sample size first is that you’re guaranteed to get significant results if you do. But if you’ve already sent a bunch of different messages trying to improve your response, you can still use some tools to figure out what they all mean.
Are Your Tests Valid?
So maybe you’ve been tweaking your messages for a while now, but you aren’t really sure about what the results mean. Maybe you found something you thought made an improvement, but in the end it didn’t do as well as you thought it would.
Hopefully you saved all your data because you can use it to see if any of the tests you did are valid. First pick the message you used as a control and the one you consider the variation. Then find the following information for each message:
- The number of people who received each message (the sample size)
- The number of conversions (raw number)
- Conversion rate
The first calculator is on the easycalculation.com website. You’ll need to enter the sample size for your control and your variation. Note that for this calculator, the two sample sizes do not have to be equal. Then you enter the response rate (conversion rate) for each message.
Click calculate and check the answer in the “significance” box. If the answer is “yes,” then you can be sure that the result you saw is valid and not a random or expected result. If the answer is “no” then the test can’t be considered valid, as the results are within what you’d expect (like getting 52 heads on the coin toss rather than exactly 50).
A second tool for checking your test results is on wvo.com. The terminology on this calculator is a little different. It requests “Number of visitors” rather than sample size. This is because the site is focused on website A/B testing, so the relevant term is visitors. But for our purposes, it means the same thing as sample size.
Another difference is that this tool requests the raw number of conversions rather than a percentage. But the rest is just the same. Click the “Calculate Significance” button. You should see a “Yes!” if your test is, and a “No” if it isn’t. That answer is all that matters, so don’t worry what it says about the P-value (if you want to know more about it though, you can read this article).
Wrapping It Up
Phew! There’s a lot to know about A/B testing. Thankfully though, just knowing these terms and tools is enough to get you optimising your results fast without needing a degree in maths.
But I have a word of caution too. There are many online resources to help with the testing. If you Google around you may find ones you like better – but make sure you read how to use them. Some may use different terminology, or make different assumptions about your data that may not be true. Just educate yourself on the tools you decide to keep using.
Also, come back for the next blog which will cover the sorts of things you can test in SMS messaging and how to implement testing in your business.