Accidentally trapping ourselves with stats

The insignificance of significance

May 31, 2022

Have you seen this situation before? There's a team that has recently launched a new feature. That team is also told they need to pay attention to the “health” of their features. To fill this need, the data team has been asked to provide a “simple” dashboard of various standard health metrics like “percent of people successfully using the feature”, “time to completion”, etc.. So far, this is pretty typical procedure.

Then things start going off the rails. The feature turns out to be rather niche and unpopular — only a handful of people wind up using it every day. So the dashboard of metrics winds up being very noisy because the sample size is low and the numbers bounce around.

Now the chaos starts. The team can understand that “the sample size is small and it's noisy, so dips in metrics might not be 'real' and worth worrying over”. But then they naturally ask, when SHOULD we be concerned about a dip? How do we know when there is a problem we have to address? Should we be doing anything now?

It’s really really difficult to answer these questions!

Over the years, I’ve seen variations on this base theme happen over and over again. The solutions that people use to address these problems also varies a huge amount.

Problem — leaning on stats to make a decision

It feels a bit silly to write the above header because one major pillar of the field of statistics is to study the use of data to make decisions under uncertainty (e.g. inferential methods). The notion of hypothesis testing entering common business use is why many of us have jobs to begin with!

But in the main situation I outlined, the problem is that the team is trying to rely on statistics to help them make a critical decision — given our data, should we be concerned enough about or product's “health” to make changes? But the fact is that the team doesn't have the statistical tools and knowledge to make that judgement.

It’s easy for people who don’t do stats to start mixing up “significant” in the technical sense, with “significant” in the common language sense. Teams that are using the terms sometimes aren’t even sure what sense they’re using when they ask questions. Since teams aren’t experts they, rightfully, lean on the data team to help them navigate their most pressing questions.

The teams are under time pressure — next sprint/quarter/year planning is happening. They need to know if there’s a problem as soon as possible because a problem could be bleeding an unknown number of potential customers every day (and the team has no way to quantify that loss without the data team’s help). Fixing “the issue” might lead to a huge boost to the business (and commensurate rewards for the team).

Meanwhile, if there’s no actual problem, then the team can go off and build that new feature that everyone has been wanting. So while it might seem silly to be jumpy over every little twitch of a metric, it’s grounded in very practical business decisions.

There's tons of technical solution ideas

As data scientists, I'm sure a million ideas have sprung to mind already about how to help this hypothetical team make better decisions given the data-poor situation.

Maybe they just need to wait until there’s enough data. It might take 8 months, but if that’s what it takes we should wait. Right?

Maybe the team just needs to understand the data better. Bring on the plots with confidence intervals or overlapping distributions. Maybe we aggregate the data until there's “enough” data to show when differences are “real”. We can come up with various ways to hide data that isn’t ready yet. Colorize, highlight, or otherwise detect and point out that “yes, this is an anomaly”. Perhaps we can redefine metrics to get better signal and dodge around the problem. Maybe there's more powerful stats models that can fit the particular situation. There’s entire startups built off the problem of “anomaly detection” and this general problem space. It’s one giant nerd snipe because there seems to be so many “obvious” ways to help — surely we’ll find one that actually helps.

If you’re not feeling like going full tech on the problem, there’s plenty of other stuff to try. Perhaps we just need to train the team to work with the tools and uncertainty better. That way they'd understand that there's not enough of a sample size and they’ll calmly wait for a couple of months before we revisit the question with more data.

If teaching isn’t your cup of tea, then maybe we can just do some bespoke analysis to figure something out. There’s gold out there in the data mines… somewhere.

Or maybe we should just “solve the problem” at the source, and get more users to use our thing so that our simple stats reports will work. Surely, that must be doable, right?

There's so much that can be done!

I don’t think this is a technical problem

Over time, I’ve seen variations on this situation play out over and over again, with different teams coming up with different approaches to the problem. There’s never a home-run solution, but some methods appear to be more or less effective than others.

Treating the whole thing as a technical problem to be solved tends to do pretty poorly. The myriad of clever ways to surface “the right answer at the right time” to people all wind up devolving into a game of whack-a-mole with the newest edge case. Add to the fact that you’ll be attempting these methods under a certain amount of time pressure (because the teams that want this info are under pressure themselves), and you’ll rarely have a moment’s peace to plan and execute a big solution. This is very clearly not a recipe for success.

Instead, I feel this is a a social and process problem that can be mitigated at the human level. If we take a moment and ask ourselves why exactly are we in this mess to begin with, we can perhaps avoid the worst of it altogether.

Hold up, does it even matter?

A couple of times, the whole problem of “is this drop statistically significant?” can be completely sidestepped by taking a pause to figure out if the drop even matters even if we assume that it is real regardless of what the stats say. Would we be doing anything differently? Does the overall effect size even make a difference in the grand scheme of things?

Countless teams have come to conclusions like “Oh no! 25% of the people who start on the path of buying our widget actually buy the widgets! We have a problem with our product pages!” and forget that only 5% of the entire userbase even have the problem the widget solves for. Even if everyone who could buy the widget does so, it’s a rounding error in our budget and we should just leave it alone. We had made a poor decision a long time ago in deciding to work on this feature to begin with.

If nothing about the situation actually matters, we can throw the whole situation into a giant nihilist bonfire and happily move on to something more interesting. Hooray!

If it actually does matter (because, sometimes it does), then we can talk with every stakeholder involved about what things we have in the toolbox to make the best out of a (statistically) poor situation.

What does “healthy” mean anyway?

Before we reach into the toolbox and start yanking out methods to deal with low-n situations, it’s worth questioning what does it mean to be “healthy”. Business people only really care about the heath of a metric or system indirectly. They’re here to maximize profit in some sense, whether short or long term, via direct sales now or sales later. Health comes into the picture only to the extent that good user experience tends to align with more sales — having a perfect, easy-to-use, user experience for people to pay money to be punched in the face by a robot won’t be making anyone a millionaire (I hope).

“Health” is a subjective judgement call.

A 25% completion rate can be declared unhealthy, or perfectly healthy, by decree. We can all choose to care or not with the flip of a switch.

We actually wind up getting into more trouble when we try to obscure the subjectivity with a thin veneer of objective statistics. Having objective health bars like “all flows must have a completion rate of > 60% to be considered healthy” runs into problems because maybe there are situations where it is absolutely impossible to be above 60%. It’s the arbitrary bar that’s the problem, not the product.

Finally, work together and figure things out

I’m sure that some people would have been thinking right now that low-n situations are perfect for qualitative methods. If you’ve only got 15 customers, track down a few of them and talk to them or observe them! You’ll get more usable insight out of a few hours of conversation than a month of data mining.

Other times, it pays to take a step back and see where the panic of the day is relative to everything else. Having a big scary drop on a dashboard is a very big distraction, and it can take a lot of discussion before everyone is convinced that the “big drop” is not particularly big, nor particularly important.

Maybe it’ll take a few smaller analysis projects to verify that things aren’t important. But the nicest thing about going down this path is that you’ll be building relationships and trust so that when the next inevitable random blip comes up, everyone can learn to take a breath and react calmly.

Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.

About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With excursions into other fun topics.

Curated archive of evergreen posts can be found at randyau.com.
Join the Approaching Significance Discord, where data folk hang out and can talk a bit about data, and a bit about everything else.

All photos/drawings used are taken/created by Randy unless otherwise noted.

Supporting this newsletter:

This newsletter is free, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:

Tweet me - Comments and questions are always welcome, they often inspire new posts
A small one-time donation at Ko-fi - Thanks to everyone who’s sent a small donation! I read every single note!
If shirts and swag are more your style there’s some here - There’s a plane w/ dots shirt available!

Counting Stuff