Data scientist, working without data
Because of course it'll happen some day
Here’s a situation that comes up pretty often in terms of usability — we all make a product, and then after it’s finally released to the initial wave of users, we get complaints that there’s a lot of failures happening. An example would be if we had lots of people buy a piece of software, but in using it, find out that 30% of the users utterly fail to get the software to function as intended and now they’re complaining. Oddly enough, this type of error didn’t really come up in prototype testing. It somehow survived through the entire development phase despite us being very careful with our research and testing.
As data people, I’m sure we’re all reaching for an identical playbook right now — can we get a count of all the error codes happening so we know what’s the problem and then decide what to explore from there? Simple measure and debug task.
But, what if there’s no tracking set up just yet because everything is new and things weren’t built the right way before you came on board?
Suddenly, instead of one relatively straightforward data problem, you’re faced with a minimum of two problems. First, we need to implement tracking, what and how do we implement it. That’s a whole engineering problem that needs to be solved that I won’t go into here today.
Instead, there’s the second problem — what can we do in the meanwhile to figure out what’s wrong without being stuck waiting on instrumentation? What’s available when there’s little to no data available to work with?
Oftentimes it’s not JUST about analysis of existing data
It’s become very common nowadays to think of data science as the stuff that happens to piles of data that’s already being collected. A lot of the hype and tooling for data science is about analyzing and otherwise making use of data. Years ago, there was a bit more discussion around creating and collecting data, via things like creating experiments, but that part of the discipline seems to have faded into the background as table stakes. It doesn’t seem particularly notable any more. I’m honestly not sure why.
So I wouldn’t be surprised if some people would see this situation and think that, aside from setting up tracking and collecting data going forward, there’s not much else we can do under the umbrella of “data science”. After all, if there’s no data, most of our tools can’t be applied.
But it seems quite silly to define ourselves by our toolkits, especially where there are still ways to help figure out what is going on sooner than later. If you take the “data” out of “data science”, you’re still left with the very important word — science.
First up, the tools of “qualitative” research are always available. Usability tests, observations, interview studies, can all be used to see what users are doing with the software and build hypotheses as to what’s going on. Yes, it involves having to talk to other people (which I’m most definitely terrible at), but even with a tiny sample of just a handful of users, you’d expect a huge effect size of 30% to show up pretty quickly. You can even use data methods to identify who you might want to interview by isolating specific error cases in the data.
I use the term “qualitative” in quotes because, honestly, they’re just another method to have in the toolbox. You can use quantitative tools as part of the methodology with fairly small sample sizes and still squeeze some useful information out of them, just take a look at a lot of social science research that have n < 100. You can set really wide confidence intervals for measured statistics and still get meaningful use out of where those bounds land. You can count how often people mention the same problem without any prompting and use it as a data point.
The biggest difference is you’re going to have to get up and collect the data, probably by hand instead of with a computer.
Predicting where issues are going to be before hand
But even if you don’t resort to using direct “talking to people” type of methods, there’s still creative ways to narrow down what might be causing the trouble.
For example, consider a traditional “30% of people are returning an item because they say it doesn’t work, even though we’ve tested it ourselves” issue. The traditional method is to count up the various ways that people try to use the widget, match that up against the failures, and the failures hopefully cluster around one specific path that you then blame for the problem and fix.
Now imagine the same scenario but you don’t have the ability to count up the ways people use the widget, you merely see the start and end points. What you can do is look at the most probable paths that a user will take. What are the default settings? What are the easiest combinations of buttons that a confused user might try? Due to how humans typically do things, it’s guaranteed that the default settings gets the most users, and other variations decay exponentially. There just aren’t that many routes through a thing that 30% of all users would even see. The problem almost has to involve a tiny handful of common routes. Things involving slight variations of default settings are likely be the culprit.
So just by knowing a priori that user paths are going to follow a power law-like distribution will let you make some educated guesses about the situation. Guesses that can be followed up on quickly without spending time looking at smaller routes that aren’t likely to matter even if they were all completely broken. I’ve never seen the long tail of such distributions ever come close to covering 30% of everything.
While these narrowing-down actions alone aren’t the final answer and might not seem satisfying from a pure “here’s the answer!” perspective, they’re still extremely valuable in situations where people need to make decisions quickly. After all, it might be the only information even available.
It’s possible to use our skills despite not having data
But it might not be apparent to people. It sometimes isn’t even apparent to ourselves. Since “data” is in our job titles, people can very quickly assume that we won’t be useful when data isn’t available to be used. In my experience, it’s usually not the case. There is usually something that we can do, whether it’s using our past experiences to provide background information, or pulling data that can be used to test out hypotheses. We can even create data that we need to make decisions if we have the time to collect it.
So it’s really unfortunate if you get pigeonholed into a specific class of problems due to having a single word in your job title. It’s something that I’ve had to occasionally fight against in various organizations I’ve been in. I’d contribute something to a situation and people would be surprised and say “I didn’t even know you could do that!”. People aren’t going to ask you to do things they aren’t even able to imagine you doing. You have to make an effort to show it yourself. For someone who’s not super outgoing at all, it’s hard work.
But it’s worth it.
If you’re looking to (re)connect with Data Twitter
Please reference these crowdsourced spreadsheets and feel free to contribute to them.
A list of data hangouts - Mostly Slack and Discord servers where data folk hang out
A crowdsourced list of Mastodon accounts of Data Twitter folk - it’s a big list of accounts that people have contributed to of data folk who are now on Mastodon that you can import and auto-follow to reboot your timeline
Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.
Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.
Support the newsletter:
This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Tweet me with comments and questions
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!