Analyzing things at scale is one of the biggest reasons data scientists have jobs. Much of our skillset is dedicated towards smashing math and tech together to find patterns within a sea of badly formatted data points. There’s tons of work and value to be done in this field and it’ll only get more complicated as time progresses.
The thing about the whole Big Data “revolution” situation is that it’s not about finding correlations in a data set. The technology and algorithms to trawl through massive tables in uncover spurious correlations has existed for decades. The whole point of the modern data science movement is to leverage machines to do things that are actually meaningful to humans. Then we do it quickly, for the benefit of lots of humans.
Sometimes, like in the case with all the cool neural network advances the past decade, advances in technology and algorithms do the heavy lifting. These sorts of solutions wind up creating artifacts that people use, whether it’s categories and identification of objects, chess and go bots stronger than any human, uncanny text generators, or very realistic fake images and videos. All this stuff in its final consumer form involves no human interpretation.
But the other side of data science, the “help people make better decisions”, part is all about using data and tools to pull a narrative out of the dataset. That narrative could be descriptive, it could be about correlations, it could be causal. It might be predictive and forward looking, or hypothetical ‘what-if's’. All of these types of narratives require very different tools, techniques, and data collection methods. But at the same time all of these stories also don’t create anything besides knowledge that someone could potentially act upon.
With all the focus on analytical methods and tools that exists in data science. I feel that it’s very easy for everyone to forget what JD points out in the tweet above, that we’re (usually) telling stories about real people in the real world. The decisions we’re making will affect them in some tangible way. We’re not shuffling numbers around like 1024 or managing universal paperclips.
The problem is that many of the methods employed in data science wind up burying those stories. Today, I’d like to encourage everyone to remember to put more effort into seeking out human stories within our data, and to refine the skills needed to delve into them.
How I got addicted to analyzing for story
Many years ago, I had just graduated from my MS in Communications (a social science, definitely not engineering-related). By some miracle, I had managed to convince an interior design consultancy to take me on as an in-house general-purpose data analyst/problem solver for my first job.
The company was focused on helping places redesign their offices, from school libraries, to big corporations, governments, and banks. Lots of places had offices designed to be depressing cube-farms in the 70s and 80s, and they wanted to revamp into more modern floor plans.
Their biggest claim to fame was they’d not only do surveys and interviews with clients, but would send people to go and observe and walk through the office for an entire week to gather tangible data on occupancy of meeting rooms, desks, etc. That let them make strong recommendations on how many meeting rooms an office actually needed, as well as tell people how many desks stood empty throughout the day because everyone was at meetings. They had hard numbers that said “your meeting rooms were occupied only 3 hours/day for a week of observations, you don’t need more rooms”.
The job was fascinating for me because I was essentially this guy with a hodgepodge mix of math, programming, and analysis skills thrown into a group of primarily architecture and interior design professionals. Even now I very distinctly remember two projects that finally clicked in my mind how awesome it was to be able to read the story within a data set.
Ghastly air at a chocolate shop
The first was while working on a project that was helping a very major brand of chocolate and candy redesign their headquarters. The existing layout dated from the 70s and the consultants that visited used the words “grey” and “depressing”.
The initial site survey that was sent out to everyone in the building asked lots of questions of the form: “What do you feel the quality of X is?” Where X is some building property like air quality, lighting, noise, access to restrooms, etc. Then it asks “how important is X to you?” Both of these were on Likert scales, (5-point ones, IIRC).
One focus for the survey was find things that people deemed important but was low quality as the most important places to look into. The survey also included open-ended responses in various places, collected location/team data, pretty standard stuff. It normally would let the team know if different people experienced different pain points or had different needs.
While analyzing that data set, certain themes started to pop up. One of those was “Air quality”, people would say that it was important to them and it was generally rated poor or neutral for most people. That by itself was pretty useful, but when the big gap for air quality showed up, I started reading through the ones for the air quality responses to see what was going on.
What I found were complaints about the air vents causing black stains on nearby ceiling tiles, smells and odors, and other horror stories. It really drove home that a 5-point Likert scale only lets you say that air quality is a 1 or 2 when you’re dissatisfied. It’s hard to express “this thing might be spraying mold or soot into my face” with just numbers. Also, some people (myself included) almost never use the most extreme values of a Likert scale, if I were an occupant of that office, I might’ve put in ‘2’ for air quality because “It’s disgusting, but I don’t smell anything at my desk yet?”
I showed this finding to the consultants on-site and it resonated with them. They then showed me some of the pictures they had taken of the office vents that they noticed were dirty, and that corroborated the stories. Fixing the HVAC issues became one of the big recommendations for that project, regardless of if the client went forward with redoing the floor plan.
No peace in the library
On another project, I’m analyzing a survey for a university looking to redesign a student library at a large university. The survey went out to students who used the library. At first, it was hard to make sense of things, noise and room to study was important as expected of a student library. Responses didn’t provide a clear “FIX THIS NOW!” signal, just a broad spectrum of complains and dissatisfaction.
Without context about what was going on in the library itself, it was hard to figure out what was going on just from these survey responses. So, I started delving into the free responses again, and a picture started forming. It sounded like the place was overcrowded, maybe over-capacity. Students were staying they, or other students, would stack up books around them at shared tables to create makeshift sound/privacy barriers. Those book stacks would of course contribute to the lack of space on shared tables for groups that needed to do group work (which of course contributes to the noise).
Other students mentioned how individual study desks located within the stacks were considered valuable, again for noise and privacy reasons. Students would apparently attempt to claim them for long stretches of time with stacks of books and personal items, and other students would notice and complain.
Eventually, I wound up building a picture in your head that students were tripping over each other in an over-utilized location. Again, the open responses were extremely important in coming to that conclusion. As junior analyst with zero architecture experience, i had no idea what to do about this information, so I handed that to the designer in charge of the project and they incorporated it into their one of their potential design recommendations.
I believe that recommendation was for relocating some of the book stacks to a different location to free up space to create more student-friendly study areas, as well as enclosed meeting rooms that students can use for. Storage lockers were also suggested so students could temporarily stash items, hopefully encouraging them to not leave things at a desk to claim them long term.
Numbers tell a mind-bogglingly tiny slice of any story
These experiences early in my career were what convinced me that it takes time and care to pull a story out of a data set. More specifically, it takes actual human brain reading and processing time examining and deciphering messy context.
It’s very quick to clean up a CSV and find that 70% of people rated air quality poorly, or 45% of students complained about noise, but it took that deeper more qualitative look to truly give weight to the actual severity of the situation.
It’s very tempting, especially early on, to choose to ignore open-ended responses because they’re not easily analyzed. Numeric scales and measurements are… numeric… we can shove them into a wide array of tools to get results that we want. Reading hundreds of individual lines of text seems just so inefficient, with no guaranteed payoff at the end.
Obviously, I don’t agree with that sentiment. But this isn’t a call to say that we need to always collect and read open-ended response data.
This is a call that we need to ALWAYS try to look at the problem from many angles, many sources, many different metrics and measurement methods, before we even stand even the slightest chance at understand a fraction of the “true” story. Anything less means just willingly submitting ourselves to the illusion that a handful of 5-point scales, a couple of “Strongly Agrees”, a percent of a segment, was a complete story.
We need to seek out the full story
But how do we go about that? Is this a call to call in the qualitative researchers? Well, yes, if we have access to qual skills, but there’s plenty of stuff to do when we don’t.
The key is to remember that our primary enemy in this situation is loss of detail.
When data is compressed, averaged, bucketed, aggregated, summarized, we lose important context that can suddenly click to make a full narrative. It’s like a more generalized form of Simpson’s Paradox. It’s like one of those mystery stories where the author presents you one explanation of events, then pulls a “But actually…. “ and changes everything by adding some extra details.
Somewhere, there’s information that’s stored in either open-ended comments, a sub-group, a data-collection anecdote, or some unique demographic that exists within the data that lets you view the results with the lens needed to create a powerful narrative that can help pull the whole narrative together into a convincing argument.
This is why we need to be very careful in the data collection and cleaning stages of work. We don’t want to accidentally fail to collect this information, or accidentally destroy the information before we realize how important it is. The temptation to normalize and standardize data is very high in data science because some minimal level is required for our tools to operate.
We need some theory
But wait, how can we anticipate what data to collect and preserve before we even work with it? The answer is theory, or at least a set of hypotheses!
A theory is a model of the world and process we’re interested that tells us what’s supposed to be important (assuming the theory holds). Theories, like the ones we generate in our line of work, can be untested. They’re just our proposed mechanisms and relationships. We’ll test them later.
Similarly, having an understanding of the tools we’re using to measure, like how users will put important information in optional open-ended response fields if you give them the chance, helps you cover your blind spots better.
This process of having a theory and figuring out ways to measure measure it in practice is actually very similar to the process of “Operationalization” that’s taught in many social science research methods courses. Since there’s no direct meter available to measure abstract constructs like “Happiness”, we have to theorize what Happiness is, then create proxy measures that need to be shown are good measures of the concept based on theory.
Maybe we ask people to self-report happiness on a scale from 1 to 10. Maybe we stick a probe on your head and look for activity in certain parts of the brain. Maybe we count how many smiles they make in an hour during an activity. None of them actually measure the thing we care about, but we hope to show that our metrics consistently measure Happiness and only Happiness.
The key point of all this is that our measurements of things are usually proxies. No one’s judgement of air quality fits onto a 5 point scale, and there can be other, arguably more objective, ways to measure air quality, such as using particle counters.
Collect your own data when you can
Very often in data science, we use what data we have available — primarily because there’s so damn much of it to begin with. Entire products are instrumented and spitting out logging event telemetry without much thought into what gets instrumented because “it might be useful in the future”.
But without intentionally collecting things for a purpose, it’s very likely that whatever is collected doesn’t really match up to an operationalization of our true metric. It’s always a little sketchy, a little slapdash, and lots of “we’ll make do”.
We can’t do this for everything, but once we abdicate the responsibility for clearly defining what gets collected and what doesn't, it creates a blind spot where we stop questioning those details out of laziness. It's easy to miss places where things have been skipped over.
In the air quality example above, why didn’t we go back with object air quality testing equipment once we realized it was a problem? Because clients weren’t going to pay for it and we could make a strong case without it. For a web store, why don't we adjust our categories? Too much stuff depends on the existing ones and we don't have a migration plan.
As you can see, there’s ALWAYS going to be friction involved in collecting bespoke data. But it’s important that we always keep the option open. It forces us to think about what data we actually need, and perhaps we can do it next time around.
We want to make sure to capture as much useful variation we can. This means including funky open-ended responses. It also means ignoring the impulse to only collect the bare minimum and going out of our way to capture things that only MIGHT matter in certain situations.
So what, are we p-hacking now?
This thought may have passed your mind earlier when I mentioned that there's possibly some small segment hiding within the data that provides the narrative lens that pulls a whole analytical story together. This sounds horribly like mining a dataset until we finally find a significant result, effectively p-hacking.
We don't want to do that! At most this is hypothesis generation. We can’t and shouldn’t sell these results as “True” based on a mere statistical significance test. That’s not the point of the exercise. It’s uncovering factors that warrant closer inspection. Those closer inspections might show that it’s actual truth, or not.
It’s why it's important to have different unrelated measures all points. Lacking a follow-up study to give us any confidence, we need triangulation and a theoretical framework for why a relationship should exist. Can we see a mechanism for why a relationship exists? Can we see other things that imply this relationship is true?
Other ways to allow us to read between the rows
Three things: qualitative research, domain knowledge, and experience with the tools.
Qualitative research, which in this context very often boils down to “asking and/or observing people” usually provides very right anecdotal data. While individual experiences are unique, even after a handful of subjects, patterns often emerge. There’s no reason us quantitatively inclined folk can’t learn to apply qualitative methods to get a fuller story, many of us often also have qualitative research counterparts exactly for these situations.
Domain knowledge, either within ourselves or within others. Having experience with things will usually offer insight into what is happening. Much of that insight helps you generate better hypotheses as to what’s going on, what responses are out of place, what common baselines for other responses are.
Finally, knowing what your tools typically do can help. When you use a survey multiple times, you can get a sense of how people normally respond. A/B tests on certain things usually fall within certain ranges. You can use this knowledge to spot anomalous subgroups and other quirks. Having a reference point from previous runs is very helpful.
You’ll notice that I’m very vague with all of these, mostly on the level of “it will help you spot things that are out of place. That’s because reading between the lines is a subtle endeavor. You’re typically adding color and interpretation to a largely fixed narrative backbone built around the existing data. The context makes the data . It’s only in rare instances that you find a subgroup that blows the narrative up.
Due to the inherent challenges involved, I don’t think it’s possible to master this skill. However, I think it’s possible to train ourselves to get into the habit of making the honest attempt to read between the rows of data as much as we can. Ask yourself, who are the people clicking “agree” to this item. Why would are there conflicting responses to what you thought was a universally-loved feature? What’s the life of the user who needs to use an obscure feature like?
Getting closer to those answers can point you to the right place.
About this newsletter
I’m Randy Au, currently a quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. The Counting Stuff newsletter is a weekly data/tech blog about the less-than-sexy aspects about data science, UX research and tech. With occasional excursions into other fun topics.
Comments and questions are always welcome, they often give me inspiration for new posts. Tweet me. Always feel free to share these free newsletter posts with others.
All photos/drawings used are taken/created by Randy unless otherwise noted.