The key to setting up any testing program is to ensure that no biases are introduced. It’s crucial to test with a clean framework where the treatment and control sets of users are comparable. Without introducing comparable audiences, the test will not reflect valid and reliable results.
The importance of randomized control experiments
When most people think of randomized controlled experiments, clinical trials, in which one group receives drug treatment and another group receives a placebo, come to mind. This type of experiment is very common in medical trials, where the goal is to understand whether or not a newly discovered medication is effective.
In the context of online marketing and ad effectiveness, randomized controlled experiments have become common practice. Marketers are finding they consistently offer reliable results by reducing noise. In other words, they strive to deliver reliable outcomes by controlling and accounting for any variables that could impact the measured data.
Standard split-testing frameworks
Sequential testing is executed by turning a channel on or off for a certain amount of time to measure the impact on the overall baseline, typically the website conversion rate. This type of testing intends to reflect whether or not the channel within the marketing mix is having the desired positive impact.
The treatment set is the group that is exposed to the marketing channel; in contrast, the control set is not exposed to that channel. Marketers then measure the difference in conversion rate between treatment and control to determine lift. Sequential testing was the first way of measuring ad effectiveness, as it’s easily accessible to any marketer and doesn’t require a vendor.
Nevertheless, it’s highly unreliable because it doesn’t remove the noise of external factors, such as marketing channels and seasonality, and it introduces risk when fully turning off a channel that could be driving valuable revenue. The introduced noise means that the groups aren’t comparable, which causes biased results. Neither the treatment nor the control is exposed to the same variables at the same time, so it doesn’t meet the criteria for a randomized controlled experiment.
Public service announcement (PSA) testing is one of the most commonly used frameworks. This testing framework splits the segmented audience into treatment and control groups. The treatment group sees brand- or product-related ads, while the control group sees PSA ads.
PSA testing’s biggest flaw is that bidding and optimization prevent the groups from being truly randomized. While the treatment and control groups are generally split at random, the applied machine-learning algorithms the vendor uses (often conversion and/or click optimization) will frequently favor the treatment group, since users in the treatment group are more likely to click, convert, or click-convert.
In addition to often reflecting inflated positive metrics, favoring the treatment group has the unintended effect of producing unaccounted-for variables, which results in a lot of noise.
Randomized split-testing frameworks
Intent-to-treat / Bid opportunity testing
Intent-to-treat (ITT) is a method that creates a holdout when the ad platform has the intention to treat the user with an ad, which is also known in real-time bidding as the bid opportunity level. Once the opportunity for bidding occurs, the ad platform will conduct a randomized split within their RTB infrastructure. Unlike PSA testing, bid opportunities are introduced post targeting. This means checks, such as geotargeting, segmentation, and cookie duration, are confirmed before splitting the audience. This removes a lot of bias and ensures comparable audiences.
The key benefit of ITT is that it’s an unbiased way of conducting testing because it compares alike audiences. Additionally, this type of testing removes noise—anything that should be, but isn’t, accounted for that results in a biased outcome. However, because this framework doesn’t ensure an actual bid or ad treatment, it doesn’t remove the noise of internal and external auction behavior, thereby causing fluctuations in lift results.
Ghost bid testing
In many ways, the implementation of ghost bids is similar to the intent-to-treat frameworks. The key difference between the two models is the stage where logging occurs. While intent-to-treat implements the split at the opportunity of a bid, ghost bids conduct the split on the actual bid level. This means that with ghost bids, the auction, pricing, and optimization of an ad happen before the split. Because the split is conducted later, heavy fluctuations are removed, which allows for large-scale testing and quick results.
A ghost is the logging event of the bid on the control set. While users are excluded from receiving an actual bid and ad, the logging event is still present. Logging the event allows vendors to either swap the bid with another advertiser in the auction or just use it to keep track of every potential bid that has been suppressed. The counts of these logs are called ghosts.
While intent-to-treat frameworks are reliable, ghost bid testing acknowledges and solves for ITT’s downside, making it one of the most reliable and sophisticated ways to measure ad effectiveness.
Want to learn more about incremental lift testing?
Whether you’re just beginning to learn about testing frameworks or are looking to increase your existing knowledge, we’ve created a quick-start handbook that will help you measure the effectiveness of your advertising campaigns.