Measuring broader impact is extremely hard

It's sorta by definition

Oct 19, 2021

A young, slightly lost peacock at the Bronx Zoo in early Autumn

A couple of months ago, someone pinged me with a question that was… complicated… to answer. So I held off a bit to think about it, and then time and other topics popped up. But now that the concept has had sometime to soak a bit, I think I can it some justice.

Cameron Henshaw @CameronHenshaw

@Randy_Au Here's some questions that have been rattling around in my head (though apologies if you have touched on some of them before). When trying to measure the impact a system has on people or communities, where do you start?

Measuring impact on people and communities. Specifically, Cameron was interested from the perspective of something like a government agency or an NGO implementing some sort of program, and then trying to measure impact and effect.

To frame it another way, if your intention is to do something that is supposed to impact people and communities in some fashion, how does one go about understanding that impact? How “broad” of a view are we talking about here?

Let’s start off by saying that the problem HARD. It is an extremely thorny and difficult one. It’s not an issue of “can we solve it” but “to what extent are we willing to expend time and resources to try to understand it”.

To start off, let’s poke at the various things that get in our way.

First there’s Goodhart’s Law

I’ve written about setting metrics off and on here, and Goodhart’s Law always comes up. Lightly paraphrased, it’s a statement about how a measure/metric ceases to be a good metric when it becomes elevated to a target. There are lots of examples where individuals lose sight (or never had sight) of the overall goal and heavily game the target metrics for self-gain at the expense of the whole.

As a quick review, there are many ways Goodhart’s law can manifest itself to ruin your day:

You might not fully understand the causal graph of the thing you’re trying to impact, so your incentives have unintended side effects
Your metric that isn’t perfectly in line with your goal, so they diverge at the long tail
Your metric was fine until you hit new extreme values you’ve never seen before and it fails badly there
You’ve got bad actors that will manipulate your metric and incentives to their own ends
Your incentive worked and everyone shifted their behavior… which shifted the causal landscape so much that your model and metric now fail
And probably more…

It’s never really possible to completely stop Goodhart’s Law because the universe can always just spin up more unique situations and examples until our models break. We also can’t simply declare that metrics and measurement is impossible and thus we should not bother, because all endeavors do need some sort of feedback mechanism to continue on.

So the only thing we can do is to always keep an eye open to the extremely real possibility that our efforts are causing unintended negative consequences and be vigilant with using triangulation and alternative metrics. We’ll also always have to be willing to adjust course and metrics when we find out things aren’t going as intended. There can’t be any sacred cows here.

We’re limited in our knowledge

Just like we typically have poor ability to fully understand the causal graph that stems from how a treatment will influence our desired variable, we’re even worse at understanding effects further downstream when we look more broadly. Adding more layers of complexity and causality to an already murky system can’t possibly make things better. If we just barely give ourselves the statistical power to observe an effect that we’re directly manipulating (which is close to how we do things for a traditional experiment), there’s almost certainly not going to be statistical power available to find a much more confounded, more diluted downstream effect.

For example, imagine if we’re testing out some public policy experiment where we give random people $1 million dollars, tax free, with the only requirement being that they do an interview every month for a few years to check in on how their life is, and we compare them to a similarly randomized control group. It’s trivial to see that with the treatment size, it’s likely we can see some impact on every individual’s life. We should expect to be able to see a group effect versus the control group. There’s probably a huge variation in what people choose to do with their money, pay down debt, spend it all, save it, etc., but we can come up with a list of things we care about and collect data on it.

But what if we wanted to zoom out and see what impact did the treatment have on the community of those individuals? How do you define community? By local geography? What if they give the money to family or people far away, or they travel a lot? Imagine if someone took their $1MM and immediately donated all of it to a bunch of international charities? Is it $0 community impact, $1MM, or some other measure of impact (vaccines distributed, food given, animals saved?). Multiply that complexity by how every individual chooses to use their money. It’s too much variation.

Not only is there all sorts of dilution of traceable causality even just one step removed, you run into issues about sharing impact and attribution. As the treatment money changes hands, it’ll get mingled with other funds and agendas and you lose the ability to follow the money clearly. This is just like how attribution can be a very difficult problem in marketing (where you’re trying to attribute whether someone made a purchase due to a TV ad versus a print ad versus a billboard).

Even if you were stubborn and created some methodology to do attribution, other people could argue for competing methods pretty convincingly. It’s a strong sign that we’re on weak epistemological grounds here.

But what about those big academic impact studies?

Every so often you see papers that examine “The effect of [some service/economic intervention/whatever] on [some big piece of society]” type paper. These studies often rely on a hypothesis generated long after things things have been implemented, and require a large amount of data points to even start building a case towards a conclusion.

A recent example off the top of my head is “Auditing radicalization pathways on YouTube”, where the authors analyzed the videos, recommendations, and comments of over 330k videos to build a case that users who view certain kinds of content would get sucked into more radicalizing sorts of videos. It’s a safe assumption that the individuals working on YouTube had not intended to create such a system, and they likely weren’t devoting metrics and analytics time on the question, but with the power of hindsight and lots of data, academics can put together such analyses. (Note, I haven’t read the paper and the methods in minute detail so I’m not making any claims for/against the specific findings, it was just top of mind today.)

I think it would be be nearly impossible to do such research looking forward instead of backwards. There are just too many possibilities to consider, and the majority would probably not come to pass. How would you know a priori to choose to analyze radicalization and not, say, acquiring new skills for jobs, or promoting dangerous new pseudoscience diet fads? All of these things are possible, and one could argue that any one is more or less likely than the others.

So I wouldn’t count on these sorts of studies to save your causality. If anything, by the time you have the data for such an analysis, your funding would’ve probably long run out.

Despite everything, our primary tool is direct measurement

Despite all the potential for issues, directly measuring expected impact at the tangible “zoomed in” level is still going to be our primary tool. This is because we’re going to be devoting resources into measuring whatever it is we care about. We have a chance to pick operational details like methods, controls and comparison groups, as well as any randomization/assignment criteria. All this lends itself to maximizing our chances of finding a measurable effect because we’re actively looking in that area. Any extra analysis we might do later to find broader effects is just bonus material (or p-hacking depending on how you think about it).

All this talk about Goodhart is a warning about HOW we pick our primary effect measurements. We need to pay attention to a lot of details and a think ahead far more than it would initially seem. But it does NOT mean that we abandon the whole measurement process altogether in a nihilistic fit.

When all else is murky, there’s qualitative methods to help

We’re always cautioned against using anecdotes as “data”, because there’s lots of room for bias and self-selection to creep into those data points. But all the same, they’re a form of data point. Not-as-quantitative fields such as Anthropology and Sociology still find ways to use qualitative methods like interviews and journal studies to find interesting results. There’s scientific value in those methods. The only thing to keep in mind is that those findings are of a different character than quantitative methods and thus the conclusions we are able to draw are different.

For example, I looked up some details on the most recent iteration of a game I translated about 16 years ago and updated a few years back. Over many years the translated version has been downloaded somewhere between 150k-250k times from multiple stores and sources.

If you asked me whether that work has had an impact upon people, I’d have no doubt that there is an effect. From scanning unsolicited reviews, the very occasional email or interaction from a random fan, I get to hear stories about how the game was exactly what they needed at a certain point in their life. All I have is a biased collection of stories from people who happened to speak up, each is unique in their own way. Even if I don’t know the “True Number” for these effects, having heard tens of stories over the years, I know that they’ve all been positive, and the count places the absolute minimum positivity rate at 0.01% of the population or better.

If you asked me whether the work has had a negative effect on people, I’d have more trouble answering. I never heard anything to that effect because people often self-select away from giving that sort of negative feedback in places I can see. But given the huge population and the law of large numbers, I wouldn’t be surprised if a handful of people attributed negative experiences to it.

In the end, these qualitative data points will help me make some kind of evaluation of impact, just with a bunch of caveats as to the generalizability of findings. Patterns within the qualitative data could build up hypotheses that maybe can be checked with quantitative data for another research project or a future new project. Quotes could be cited when sharing details about the project to others.

Would this be enough data to present to people higher up to convince them to continue funding the project? Probably not on its own. This is why you still need to have your primary metrics set well and measured. All theses “further down” metrics, by their difficult nature, are never going to be as convincing a signal as the primary experimental data. But there’s still useful information and patterns contained within piles of anecdotes that could potentially tip the scales in an unclear situation.

So is that it?

I know it’s not satisfying to see that our primary means of measuring impact, the “define and measure it directly at the outset”, with all the potential for flaws and issues. But in terms of scientific method, it’s the best we method we have available to us right now. If in the future we come up with better methods for dealing with these situations, we as researchers should adopt those methods where appropriate. That’s how science works.

We can only take heart in that there are people who think and work on these problems, for example the people doing research into Goodhart’s Law. There’s continued work in this space, but I’m not sure we’ll see giant breakthroughs in methodology any time soon. The changes will likely be subtle and (ha ha) hard to measure.

About this newsletter

I’m Randy Au, currently a Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. The Counting Stuff newsletter is a weekly data/tech blog about the less-than-sexy aspects about data science, UX research and tech. With occasional excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise noted.

Supporting this newsletter:

This newsletter is free, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:

Tweet me - Comments and questions are always welcome, they often inspire new posts
A small one-time donation at Ko-fi
Buy one of my photo prints
If shirts and swag are more your style there’s some here