Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.
Lots of videos this week! Normconf’s Lightning Talks have been made public! The main conference is on the 15th and is free! My talk is about how it’s ok to release less-than-perfect code. Also, I had a really fun 40min chat with Jesse Mostipak at Baseten about data science, hobbies, and counting. Depending on how you count, we almost averaged 1 topic a minute.
[1:30pm Update: fixed the 40min chat link to point to the right video]
For a long while now, I've been struggling with a fundamental problem — we often say that data analysts pull out patterns and stories out of data that become hypothesis fuel for further research and decision making. But how do we actually do that? Can we describe it in a way that's useful for teaching someone who is learning to do this weird art/science instead of just telling them to “finish drawing the owl”.
I ask this because when I stop to think about what I'm doing when I sift through tables and charts and somehow manage to pull something of a coherent story or two out of that mess, I’m not exactly sure what process I’m using to accomplish that. There’s a complex dance going on in my head bouncing between domain/business knowledge, knowledge about what sorts of questions are likely to be interesting, and flipping around in the data trying to see if there’s support for those various interesting ideas. If those ideas work out, they my constitute a story.
The typical story for how analysis is done is that the analyst “finds patterns” within the data. But how do you even teach someone to find patterns? None of us trained by sitting in front of a grid of numbers until we could spot the Fibonacci spiral hidden within.
That's not actually how it works because one of the most annoying things you can do to a data person is give them a pile of data and ask them to “find interesting stuff”. We're not robots p-hacking our ways to finding every single arbitrary correlation in a dataset and then rearrange everything into some kind of slide deck. “Data mining” as a concept has luckily faded away from general usage now due to how ineffective it was.
What’s it mean to pull a story from data?
When people say that data is being used to tell a story, it’s often framed in something of a journalism context — as in “we’ve found this set of relationships within the data that leads us to draw these conclusions.” The presentation style lends a certain amount of inevitability to the conclusion. This is probably an incorrect way of presenting data analysis in the context of data science, since there are usually multitudes of relationships that can be tied together into multiple conflicting stories. Given enough data and willingness to cherry pick data points, we can tell any story we want.
Analytics shouldn’t really have an axe to grind about a particular data set. Unless we’re employing much more complex techniques involving statistics and experimentation, it’s likely not possible to definitively show that something is “true” in any meaningful sense outside of “well, we found this specific relationship in this particular data set”.
So instead, data analysts are relying on their domain knowledge to generate a set of hypotheses that fit the data that is available. Sorta like in a mathematical induction proof, an analyst must come up with a formula that fits the pattern of the n observations of data in front of them. When new data stays consistent with the formula, then they have increasing confidence that their story is on the right track. BUT there can always be a single counter-example that pops up 50,000 entries later that disproves the story. There are no guarantees to truth.
Under that backdrop, much of the skill of a data analyst comes from being able to generate hypotheses that are both plausible in fitting the available data, while also being of interest to business stakeholders. Then they somehow look for threads within the dataset that support the hypotheses, and see if they can triangulate similar findings using other measurements. If it pulls together into a coherent whole that resists falling apart upon closer inspection, they’ve done it and created a story out of data.
But that still doesn’t explain how does someone actually DO analysis
I’m not sure how other people do it, and I imagine that there’s a lot of variation and personal experience involved. But I’ll try to give an example of how I work with a data set in an attempt to pull an analysis out of it.
As an example, let’s say that we’re working at an e-commerce web site as the data person. We’ve been asked a very common question by some product lead or executive: “We recently had more item sales than we’d normally expect. Please go figure out why and whether we should expect the sales increase to be temporary or more permanent.”
1- First check the premise is real
The first thing I would check is just the simple truth of the statement “we’ve sold more items recently than we’d normally expect.” Sometimes it’s OK to take such statements at face value, but my personal experience is that it’s a good idea to verify that the underlying premise is true. One reason is because if it turns out to be false, I can possibly stop working right there. Otherwise, you can go into a whole separate and utterly fascinating project understanding why someone thought our sales were higher recently when in fact they’re not.
But another reason to check the premise is that even if the premise is actually true, you need to know exactly what “increase” the stakeholder is talking about to begin with anyway to start your analysis. Is it that sudden spike in sales last Tuesday that quickly went away? Or is it the subtle general increase in sales that trended up over the last 8 weeks? The answers for “why” likely differ between the two, and knowing exactly what the request is makes everyone happier.
Since I work as a UX Researcher, we habitually pay a lot of attention to this step because people often come to us with a specific request (for a survey or some other analysis), but the actual research question they want answered is much better served with a different line of questioning and methodology and we need to work with them to verify that they’re actually asking for the thing that they want.
For this example, let’s just say that we’re seeing the total count of items being sold has gone up steadily over time, so that we’re now selling 20% more than the exact same time period a year ago. We observe that the increase was relatively steady over time, with no surprising spikes or dips that stand out as being peculiar.
2- Think of possible causes for the effect observed
We can’t just blindly dive into the data and start searching for correlations. There’s too much data and not enough time to make sense of it all. Here’s where the domain expertise comes in.
If sales are up, there are typically two caricature explanations for it — 1) more people came for some reason and bought stuff, or 2) the same number of people bought stuff, but everyone bought more in one order. Either one would “explain” the observed phenomenon of “total sales of widgets went up”. From there, it might be worth trying to track down just who’s buying these items and where are they originating from.
Now we’re in the most difficult part of the process. You’d think that things would get easier with a pair of hypotheses in hand that we can go investigate, but no. Things are about to get exponentially more complicated.
Let’s say we check the “did more people buy items vs the same people bought more items” situation and find out that more people came to buy stuff. Now this triggers all sorts of potential follow-up questions.
Where did these new customers come from? Did we run an ad campaign? Or is it all “organic”?
Did all these new customers come from a new country/market?
What’s the average order size in items and dollars? Are people buying lots of inexpensive items and we’re actually making fewer dollars per order?
What are the items being ordered? Is it one or a handful of items, or everything across the board?
etc…
You’ll have to decide whether these follow-up questions are interesting enough to spend time investigating. Oftentimes, you need to anticipate how likely a given branch will yield interesting results and consider what further follow-ups that might lead to. Answering these questions very likely takes one-off work to pull and analyze the data. Many of these questions will lead to uninteresting dead ends and cost time. You’re going to have to pick a stopping point for your work somehow.
In my view, really strong data analysts learn to quickly ask interesting follow-up questions while finding clever ways to quickly answer them in ways that help triangulate findings to help build a clearer picture of the world. It’s a lot of saying: “If this fact I found is true, then another thing in a separate but connected system should be also true.”
I find this process to be the hardest thing to explain about doing data analysis. Domain knowledge is required to even come up with many of these questions, and a student trying to learn analysis very likely does not have any relevant domain knowledge for most of the datasets they touch. I’ve actually had people ask me how I come up with streams of hypotheses and the only explanation was “I’ve seen something similar before”.
While getting domain knowledge seems daunting, imagine approaching the data without any domain knowledge. You’d be haphazardly dividing every metric by every other metric in hopes of finding some kind of interesting ratio or trend. And how would you tie it all into a cohesive story when you don’t know how the individual parts interact?
The only way I learned how to ask a lot of these follow-up questions is from talking to domain experts about my analysis and have them help generate ideas. They’ll helpfully chime in with things that they want to know that I haven’t encountered before. They often point out the implications of a change in a certain metric, like if item sales goes up, revenue is expected to go up too. Domain experts will also point out when metrics seem off and not behave according to their highly refined mental models, which can be a signal to dig deeper in an investigation.
Imagination alone is a poor substitute.
3- Pull the threads together into a set of stories
After investigating a bunch of ideas and leads from the previous part, at some point you start feeling like you’ve run out of ideas to investigate and a more-or-less coherent picture is beginning to form.
Here we can invoke a big assumption, that the original problem of “why are we selling so much more stuff?” is best answered by a story that fits all the observed phenomena we’ve been finding in our investigation. The users seem to be coming out of a couple of large cities that we had recently expanded service to. Those customers are ordering mostly the same kinds of items, in the same order size, as other places (with some regional differences). We’re making similar amounts of money per order as anywhere else. We’re also not paying advertising dollars for this extra traffic, so we can’t accidentally turn off a campaign and our sales drop.
The various steps in the investigation wind up contributing a little to the story each step of the way. Here, domain and business knowledge is also important in crafting the story as a whole. The previous paragraph is largely aimed to give a decision-maker some confidence in the situation. They don’t have to worry that there’s some new weird fraud or loophole where people are forcing the company to lose money. They’re reassured that the end result seems good to the business, that expanding to those markets seems to be helping the business, and that it isn’t likely to suddenly disappear.
Sometimes there are different hypotheses that all fit the data to a similar extent and there’s no way to tell which is the valid one. In that case, it makes sense to present them as alternatives, with some explanation as to which might be more or less likely.
Either way, it takes experience to know that a decision-maker wants to know these sorts of details. If you don’t give it to them up front, they’re eventually going to wind up asking you for them.
It’s also super important to note that there’s no guarantees that any of the stories we weave out of data has any actual truthful basis! It just so happens that the things we found seem to point in a similar direction. It’s possible that there’s some hidden bias in the data, missing data, or completely unknown confounding factor that is the actual cause. It’s very easy to fall into the trap of finding some initial findings that confirms a story and then get sucked into confirmation bias and become blind to alternative hypotheses.
4- Handling conflicting/messy narratives
It’s very rare to have a clear and coherent story that magically works out all the way. Due to how data and complex systems work, you’re very likely to encounter conflicting or inconclusive signals in the data.
Imagine if the number of units sold is up, but revenue hasn’t moved increased by the same proportion. Maybe the mix of items people are purchasing has shifted. Maybe a a popular large item is out of stock and people are buying smaller packages of a cheaper brand to substitute. Maybe we had introduced a pricing bug to the system and everyone’s been getting an unplanned discount.
Maybe this is happening while there’s a second story going on, like a large TV campaign being run in Canada just as supply chain issues force a bunch of popular items to be out of stock. Now there’s no easy way (or any way) to tell what is going on. Sometimes it’s just not possible to know post hoc.
What’s a data analyst to do?
Sometimes it might be possible to collect new data outside in a way to tease effects apart. But other times? The only way forward is to report that “these two or three events happened all at once and likely had an effect but we can’t tease them out given the data we have”.
5- Practice
Finally, the only way to get better at analyzing data is to actually analyze data and be challenged by other people on your results. Other people with different perspectives will always come up with surprising questions.
So keep at it.
Please reference these crowdsourced spreadsheets and feel free to contribute to them.
A list of data hangouts - Mostly Slack and Discord servers where data folk hang out
A crowdsourced list of Mastodon accounts of Data Twitter folk - it’s a big list of accounts that people have contributed to of data folk who are now on Mastodon that you can import and auto-follow to reboot your timeline
Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.
New thing: I’m also considering occasionally hosting guests posts written by other people. If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.
Support the newsletter:
This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Tweet me with comments and questions
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!
No guarantee that the story is true. I used to do a particular type of modeling to help find potential oil fields. One thing I would do is deliver both an optimistic model and a pessimistic model, both of which would fit all the available data. Then it was management's problem.
Step 1 is super important.