We take our units of analysis for granted

Because it's so easy

Jun 22, 2022

Long driving trips make it hard to do writing. Sending it out at 11PM counts as being on the same day… right?

Typically, when the term “unit of analysis” comes up in a research methods context, were talking about defining the specific thing we are analyzing to answer our research question. If you care about how people perceive your new flavor of ice cream, you'll likely want to analyze how individuals rate the flavor. Meanwhile if you want to compare regional differences in preference for your ice cream, you'll want to be comparing aggregate ratings from areas (though you’ll be collecting data at the individual level).

The key point is that the unit of analysis is expected to vary based upon your research question — they’re intimately linked. In the ideal situation, the selection of the unit is independent of any experiments prior or expected in the future. You pick the best thing for the question at hand.

In practice, things aren’t ever that independent. We are make compromises for various reasons. We collect data at a suboptimal levels because it's much cheaper and easier to do, or because it'll be useful for other research later on. Or maybe we can only use proxy measures for the question we care about. Or the data had already been collected for another use case (aka “found data”) and we are just leveraging the dataset. Or we’ve been using the same units of analysis for years and forgotten why we’re talking about a specific thing and not anything else.

This week, I was reading SeattleDataGuy’s newsletter that discusses various technical reasons as to why “simple questions” like “how many active users do we have?” is surprisingly difficult. Reasons could include things like constantly shifting business logic, weird edge cases, merging multiple data systems, or just having ambiguous definitions to begin with. Those are all very true reasons for why simple measurements can become mind-blowingly difficult tasks.

But as a researcher, the first answer that popped into my head wasn't really covered — we don't know what we're doing, we might be asking the wrong research question, our unit of analysis is likely wrong is some way. This line of thinking is what leads to today's post — there's some weirdness in how we pick our units of analysis in industry and I want to prod at the idea a bit.

Business units of analysis start off intuitive, then not so much

At the start of a business, when people are just trying to figure out what they need to ask of their data, they'll naturally pick the most “obvious” things — revenue, users, contracts, and similar fundamental units. They make sense starting out because if those fundamental units go to zero, the business dies. There's also not much data available so it's not possible to segment the data into more detailed things. Given that there’s not much else that’s possible, the business makes use of the information it has available to make decisions as best it can. They see correlate what actions make those metrics go up and learn from them.

But just like there’s never really a single “user profile” that fits every single user of anything, monolithic metrics are really poor at giving all but the coarsest levels of information. As time goes on and the business grows, we start bumping into those limitations. We want to start optimizing our business but it’s not clear what information will allow us to proceed.

There’s a growing intuition that these monolithic metrics that we inherited from earlier, simpler, times are not enough — some users are happy while others aren’t. Some contracts wind up being losses while others are very profitable. There are obviously subgroups within our basic metric that might hold meaning, and further success, for us.

This is where we start reflexively trying to segment the units of measure we’ve inherited from simpler times. While it might be obvious that we should be trying to understand fundamental objects like “users” or “contracts”, we’ve actually embarked on the process of figuring out what objects actually matter to our goals. Is it “active” users (whatever “active” means)? Is it contracts from certain industries?

We fail to realize that we’re not simply “looking for useful segments” but searching for new fundamental unit definitions

Doing these “search for the relevant segments” work is extremely difficult. It’s often considered a “trap project” for people starting out because it appears deceptively simple (just cluster users by features somehow and find magic!). But since it’s very hard to predict a priori what slice of population will actually make a difference in analysis, months of time are easily wasted going in circles.

We’re often lulled into a sense of overconfidence about such projects because it feels like we’ve been working with the same unit of analysis, users, as we have always been. We’re not trying something new, so our past experience and knowledge must be relevant and make things easier, right?

Sadly, that’s an illusion because the work needed to establish a segment is effectively identical to the work that’s needed in creating a completely new unit of analysis from scratch. You still have to hypothesize what a useful segment is, make make a clear operationalized definition of the segment, then validate that it has the analytical power that you’re looking for. It’s

Existing units of analysis can act as blinders

Many years ago, I was listening to a founder talk about a product he was working on. They were targeting a specific market segment with their tool and their metrics just wasn’t showing much traction in that regard. Meanwhile, they had younger kids and teenagers constantly sending in small amounts of money, sometimes in envelopes with coins inside, in clamoring to use their tool. The team actually deliberately tried to prevent kids from using the service since dealing with minors and small payments had its own unique hurdles (this story predates a most of the laws surrounding providing internet services to minors).

Thanks to them actively trying to avoid this market segment, they were ignoring the data involved — they had simply refused to treat “is this group of kids our target market?” as a legitimate question. Until eventually, the founders realized that the people who were actively trying to give them money should actually looked at more closely as a business opportunity, regardless of the hurdles involved.

This situation is surprisingly common in my experience. It’s so easy to continue looking at the segments of fundamental units that have a long history, the “active users”, “paying customers”, “profitable contracts” that are already validated and part of day-to-day business. They're familiar, we understand how they behave and what drives them.

But that comfort encourages us to lose sight of the fact that these segments are arbitrary definitions created in a certain time and context and can be replaced at any time with something else. They can trap us into not asking important research questions.

What we should be analyzing should be in a state of constant flux

As I mentioned near the top, we honestly don't know what we are doing and what we should be focusing our analysis on at any given point. At best we made educated guesses about what's important, and then confirmation bias ourselves into believing that those are the things we should be researching.

Since industry changes rapidly according to strategic goals and market conditions, it is always a good time to question our assumptions about what we should be analyzing. It’s important to have the self awareness to

We just need to make sure that when we do embark on a journey to define a brand new unit of analysis, we also admit that it is a massive undertaking.

Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.

About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With excursions into other fun topics.

Curated archive of evergreen posts can be found at randyau.com.
Join the Approaching Significance Discord, where data folk hang out and can talk a bit about data, and a bit about everything else.

All photos/drawings used are taken/created by Randy unless otherwise noted.

Supporting this newsletter:

This newsletter is free, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:

Tweet me - Comments and questions are always welcome, they often inspire new posts
A small one-time donation at Ko-fi - Thanks to everyone who’s sent a small donation! I read every single note!
If shirts and swag are more your style there’s some here - There’s a plane w/ dots shirt available!