Trustworthy Online Controlled Experiments by Diane Tang, Ron Kohavi, Ya Xu
Almost 3 years in, I have a good sense of which parts of my job I enjoy. Running experiments is definitely one of them. That’s when we have the most agency and can explicitly find casual effects, as opposed to conducting pure analysis, which often results in circular logic. On the surface, experimentation is the most scientific part of the data science purview. In reality, it’s more art than science. Even though I’ve run over a dozen experiments, each new one still has its unique problems. I’ve always wanted to write a go-to doc for experimentation, but there are truly so many edge cases that there are no edge cases. Going into this book, I was unsure how much I’d learn. Getting an experiment right is all about getting the smallest details right.
1) At Slack, only 30% of monetization experiments show positive results.
Putting aside the accuracy of this metric, the broader point is that most experiments should and do fail. There is a strong incentive to only run experiments that will succeed or to spin results in a positive light. It takes a lot to admit that some ideas don’t work.
2) Bing spent two years and $25 million integrating social media and failed.
One meta aspect of experimentation is knowing when to move on and when to tweak.
3) Some metrics have growing variance, so running an experiment longer doesn’t necessarily help.
I’ve never conceptualized this before.
4) Twyman’s law states that the more unusual the data, the more likely it’s wrong.
This is the painful reality of data “insights.” In most cases, insights are just common sense. When you find something weird, you have to triple check your work because most likely it’s wrong.
5) For Bing, over 50% of US traffic and over 90% of Chinese/Russia traffic are bots.
I’m biased to think that rideshare experimentation is the most difficult because of the interference effects, but this book gave me a new appreciation for every platform’s unique challenges. At the end of the day, experimentation comes down to proper counting, and proper counting is very difficult no matter the domain.
6) Kaiwei Ni made an Instagram ad with a fake piece of hair.
The literal clickbait.
7) In Hanoi, a rat tail bounty program led to rat farming.
The classic analogy in experimentation is counting clicks. The goal is to incentivize the desired behavior. The hard part is defining what that behavior is.
8) Interleaving experiments can lead to faster algorithm iterations.
Netflix has a blog post on interleaving. The gist is that interleaving has higher sensitivity. Consider that in a standard A/B experiment, power users can easily skew results. By presenting both algorithms to each user, Netflix can more quickly measure effects.
9) Binary variables tend to have lower variance and are generally better metrics.
While not always applicable, yes/no variables are indeed easier to work with. Unbounded metrics add so much variance that they muddle the analysis.
10) n = 16sd^2/mde^2
Power analysis is an art, but this formula helps.
The first half of this book, as explicitly stated, is aimed at a more general audience. Even there, I learned some useful tidbits. The second half homes in on the intricacies of experimentation. It highlights practical issues like instrumentation, exposures, metric variance, etc. More than anything, it validates the idea that experimentation – unsurprisingly – is full of unclear tradeoffs. Best practices are necessary but insufficient to good experimentation.