[Technically dispatch] what is A/B testing and what did LinkedIn do wrong

What experimentation is and why people are yelling about it

Sep 26, 2022

(The New York Times recently released an article critiquing LinkedIn for its A/B testing and experimentation practices. This will be shorter form post that just explains what’s going on with the technical side, for those curious.)

(if you get this reference, email me for a discount code)

Welcome to a Technically dispatch, where we quickly explain major events in the news that relate to software and technical things. The NYT released an “investigation” tearing down LinkedIn for their A/B testing this week, and everyone on Twitter has been talking about it. LinkedIn wasn’t shy about this, and didn’t try to hide it: they literally published an academic paper analyzing the results this month.

So what exactly is A/B testing and experimentation? How does it work? And what’s the controversy here?

Experimentation helps companies understand the impact of their work

Engineers, product teams, and before product teams exist, founders, are responsible for figuring out what features and products they should be building. But how do you know what your customers want? And what if they’re wrong?

Experimentation is one of the ways that companies understand if their product efforts are working, and ipso facto, what they should be spending time on. The basic idea is that instead of trying to understand what your users might want before you build it:

Try building a small version of it,
Launch it to a small percentage of your existing users,
Measure what impact it has on their behavior,
Learn and try again

You may have heard about A/B testing in the context of changing the color of a button on a site, but this a very reductive view of things. Companies experiment on plenty of big, meaningful features too. It’s simply a way of trying new stuff out without rolling it out to your entire user base. The early feedback you get helps you understand if it’s a good feature, a bad one, or most commonly, something in the middle.

An example – imagine you’re a product manager at Amazon, and you want to understand what impact a new price comparison tool might have on purchase behavior. Would viewers have a better experience and buy more stuff, quicker? Or would it make them think more about cost, and they’d buy less? Or maybe one of a million other possibilities you can’t foresee?

To experiment, you’d build a basic version of this tool and launch it to a small percentage of traffic to Amazon.com. You’d then measure the purchase behavior (total $$, average $$, or something like that) between the groups, and compare them. If the group that used the tool bought more, that’s good signal you should invest more in building it out.

Because of the statistics of sampling, you usually need a meaningfully sized base of users to run proper, statistically significant experiments. So this practice is usually reserved for mid to later stage companies, not early-early-early stage startups. But the philosophy of experimentation still holds for those companies: sometimes it’s worth building a basic version of something and seeing how people react to it (even if it’s just a focus group).

Experimentation logistics: feature flagging and metrics

Experimentation sounds easy in theory, but requires some serious work to do right.

Feature flagging and randomization

Once your feature is ready to test, you need to figure out a way to make sure that only a small percentage of your user base actually sees that feature. Who sees the feature and who doesn’t needs to be completely random, or you can introduce bias into your experiment results. And you also need to keep track of who sees it and who doesn’t, so you can tie that data back to subsequent user behavior and see what impact the experiment had.

Basically, there’s some real engineering work involved here. It’s popular for engineering teams to outsource their feature flagging to tools like LaunchDarkly.

Measuring metrics

As part of setting up an experiment properly, you need to figure out how you’ll know if it’s working. Most experiments will have a top level metric that you’ll look to. For our Amazon example above, it’s something like purchase volume. For other experiments it could be something like time on page, conversion rate, whether or not a user upgraded their plan, etc.

They say you shouldn’t look at your experiment metrics until enough users have seen it and it reaches statistical significance, but…most data teams I’ve worked with do.

Run time and statistical significance

When you’re running an experiment, you’re doing inferential statistics — you’re basically assuming that whatever your little group of test subjects does will be representative of your entire user base. And with that assumption comes some statistical baggage, namely that you need this “sample” to be big enough. If you hear teams talking about “waiting to reach statistical significance” what they mean is waiting until the sample size is large enough to be representative of the whole population (all users)1. Experimentation and statistics are two sides of the same coin.

Managing experiment volume

When you have one experiment going on at once, things are straightforward enough. But when you have multiple going on at once, driven by different teams, you need some system to know which users are in which experiments. Ideally you want a user to only see one experiment at a time, but even that has complications.

These are a few of the basics, but there’s a lot more to worry about, especially as you have more and more experiments running in parallel. Products like Eppo help data and product teams manage and analyze their experiments, but many teams build these kinds of tools in house.

The LinkedIn controversy

So LinkedIn was running experiments – so is every single startup (and probably public company) with enough users. AirBnB has written extensively about their experimentation culture, as has Uber, Lyft, and many, many others. None of these companies inform their users that they’re part of an experiment, as that would basically defeat the purpose. So what’s the big deal?

The issue that the Times brings up is the ethical implications of such large scale experimentation:

But the changes made by LinkedIn are indicative of how such tweaks to widely used algorithms can become social engineering experiments with potentially life-altering consequences for many people. Experts who study the societal impacts of computing said conducting long, large-scale experiments on people that could affect their job prospects, in ways that are invisible to them, raised questions about industry transparency and research oversight.

The idea is that in the pursuit of improving the product, companies like LinkedIn try things that could have serious impacts on the career prospects of their users. With a company like AirBnB or Uber, the stakes are lower. But for social media, there are higher stakes here – a profile view or a new connection could mean the difference between getting a great new job, or being stuck in the one you hate. This rings true for me for sure.

But more rigorously, I’m not sure the logic really holds, because that would mean LinkedIn shouldn’t really launch or improve anything ever (what if it negatively impact’s someone’s career prospects?), and without experimentation there’s really no way for them to understand what impact anything has. I’m not sure these are “experiments on people” as much as they’re small tweaks to a free software product that you voluntarily choose to use. But this newsletter just explains the technology, it doesn’t opine on it, so I’ll cut myself off here 😉.

What did you think of this issue of Technically? And this post format in general?

If you have any questions of your own, just reply to this email or send them to justin@technically.dev.

Reminder: Technically referrals!

You can get fun prizes like stickers and free subscriptions for referring your friends to Technically. Generate your unique referral link here and get referrin’

Generate your referral link

They may also mean that the detected impact, or difference in the metric between the two groups, is big enough relative to the size of the sample. But this is very in the weeds already.