Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.
As data science has grown increasingly popular in the past decade, and with that popularity comes change as what used to be a somewhat esoteric title is collectively making its fumbling steps towards being “a profession”, like how “programmer” slowly evolved to become “software engineer”.
One of the more subtle changes that I’ve noticed is that there is very little talk about collecting data being done. The conversation is often dominated by ML/AI related chatter, tooling (programming languages, software, etc), and lots of beginner “how to become ~~” talk. Those topics tend to draw more board interest, and thus social feedback loops bubble those topics to the top.
The grievance in the above tweet triggered this week’s post because, Cat’s right, collecting data has somehow fallen out of the conversation. This isn’t healthy, both for our own profession, but also for the world at large.
“Found Data” is everywhere
Data that’s used for projects generally come in two forms, data that is intentionally collected for some specific purpose, and data that is collected for some reason but then repurposed and analyzed for a completely separate purpose.
The former, less discussed form of data, doesn’t seem to have a special name, it’s just “data”, the original. It’s your purpose built stuff. It’s what you collect when you design and specify an A/B test. It’s the survey you wrote and send out. It usually takes resources to collect or generate, and can be a lot of hard work — work that many people aren’t interested in dealing with when they’re ostensibly trying to “learn data science stuff!”.
The latter is often called “found data” and the majority of data science conversation nowadays centers upon handling this found data. A huge chunk of the data that is being used in data science falls in this category. It’s the logs generated when people interact with a web site, the data public data sets available to use for free on the internet, the data sets at the core of Kaggle, and just about every other bit of data you’ve probably touched. It’s ubiquitous.
The fact that it’s so easily obtained with just a download means that this is the primary vehicle for teaching data science. You can practice everything from loading data and creating pipelines, to rolling ML models out with just about any dataset you can get your hands on. It’s also effectively zero cost to the user.
The asymmetry of cost/benefit between the two types of data makes it obvious which one people gravitate to. No one is realistically going to think “Let me spend a week or three writing code to collect data just so I can learn how to load it into pandas and make some charts”. I know I wouldn’t.
We don’t teach data collection very well of at all
Data science generally teaches how to clean data very poorly, often amounting to “Well, you know what clean data looks like, just make your ugly data look like that!”. In comparison, we teach collecting data slightly better because we teach methods, which can sometimes be effective ways to teach how to collect data for a specific use. You can’t teach a regression method, or even an AI method, without at least some discussion as to where the data is sourced, what assumptions need to be fulfilled, and common pitfalls.
So while it’s probably not as good as if we specifically called out data collection as a critical skill, the same goal can be accomplished if people study enough different methods.
Thus, I get it, found data is great for learning the mechanics of data science and data analysis in general. You can go pretty far, and do quite a bit with just those skills alone. So just like with all the content that’s aimed at beginners, this sort of content gets a lot of attention. It’s just not enough to stop there.
Go autopilot on your found data at your own risk
The problem with found data is that if you’re not constantly thinking critically about the details, you can do stupid, incorrect, and sometimes downright dangerous things. For example, the WSJ just put out a piece on college enrollment being lower for men across the board. That article included a chart that linked enrollment to family income. Except, as some economists on my timeline pointed out, the data from the census used for that analysis and is a foundational point for the whole piece is wrong.
The link to the Brookings article in the tweet about the Current Population Survey data used for the chart goes into details about why you can’t link family income to college achievement. (tl;dr: family income is only reported if the student is effectively treated a dependent of a parent, the moment a student forms their own household in the gov’t’s eyes, the data is unlinked. This introduces a big selection bias into things, enough to make the analysis impossible.)
To grow as a data scientist, you MUST learn to collect data
I try not to use absolutes, but this is one of the very few times it is warranted. Collecting data is important because it makes you think very hard about what should, and shouldn’t be collected despite whatever constraints you have. It’s one of the primary concerns of academic research because science would almost be impossible without this step. In industry we seem to have deemphasized it to the point where some might feel it could be an optional step (which is bad).
If we don’t go out and identify the places where we should be collecting data instead of relying on found data, we’re not really doing the “science” part of the title. We’d effectively be doing work similar to an advanced data analyst — we’d only be able to work and speak within the confines of the data set as is. Every conclusion must be prefaced with “within our dataset and sample population” and can’t say anything about generalizing or making inferences.
The experience of learning to collect data and inevitably messing up multiple times along the way, is invaluable because it forces you to learn just how important data quality and collection is to the end results. This is something you can’t learn with found data, because you will never know what all the design choices went into collecting the data in the first place. You’ll never know the inevitable biases and compromises that are encoded within. Those biases don’t matter to the original research question the data had been generated for, thus they were allowed to persist, but they might matter to whatever it is you’re planning on doing.
The only way to learn about how biases will get encoded into a dataset is to make datasets and encode your own into them. More importantly, you need to be very honest with yourself about what biases you’re likely injecting. Who’s included/excluded from the data? What tech is used to collect the data and how does that mess with things? The list of things to consider and worry about is endless, and I’m not sure if there’s any way to learn how to do this without actually just trying to. There are very few instances where people think this hard about a found dataset, and those instances usually involve using government data set where it’s not possible to collect your own data short of working for that collecting agency.
A/B tests are a data collection methodology, but I’m not sure if people think about them as a “study”
I’d be exaggerating if I say that many data scientists don’t collect any data at all. We mustn’t forget the poster-child of data science, one of the keystone methodologies way back in the early mid 2000s and early 2010s that was the hallmark of being a “data driven organization” — A/B tests.
The A/B pattern has been popularized to the point that most people, even people who don’t work with data or research all the time, know the basics of how to execute one. With that popularity, I’m not sure how many of those people actually understand the methodology enough to apply it well.
Sure, you can use A/B tests to see what options are better, but there’s lots of subtle ways to screw it up tactically , from messing up the sampling/randomization, to peeking at results and calling the experiment before the experiment. More insidious though is that people very often don’t learn from A/B experiments. While you can test what color button to use until the end of time, it’s much more useful to test concepts that have broader generality — does having a human face on the screen help conversion, does important information need to sit in a big colored box, or is a modal better? Very often, these bigger more powerful questions can only be answered using a series of tests and very few people take it to that level. Sometimes it takes an A/B test, sometimes it takes interviews, or surveys or log analysis, or everything combined to come to an answer. So while the initial experiment itself is an act of data collection, when it becomes too routine and mechanical, you stop thinking.
We need to raise the importance of good data collection and validation work
Besides the fact that data collection skills require hands on practice to learn, collection and validation work is often not incentivized enough. Collection and validation are seen as a cost — it’s time and money spent on the promise (not guarantee) of getting a payoff in the form of a result at the end.
Like most cost centers, people are incentivized to cut costs down to the minimum. How many corners can we cut to get a similar result? Can we just reuse this data we’ve already collected? Can we use a proxy/indirect metric out of our existing stuff? Can we not do the expensive large randomized sample and use this convenient snowball sample instead? Surely we don’t need to spend time to validate our measures because it makes intuitive sense?
We need to bring data collection and validation back to being a prominent tenant of what we do. Even if it means we have to slow the pace of our work down to do it. It’s a necessary cost of business, not an optional one. Failing to take steps will just undermine trust in the future when decisions are made against flawed analysis and don’t play out because the premise was broken. This is currently an unquantified risk in data science right now, and so we as a field can get away with it without anyone calling out those errors. I’m doubtful this illusion can persist forever.
My sense is that we need to revive that appendix/footnote that we’ve trained ourselves out of including, the “how sure are we about what we’re reporting here” part that lists out all the caveats. “We think this is true because our data is biased in this way, but those shouldn’t affect this specific analysis because we make this other assumptions that work.” We originally stopped doing it because it was longwinded and no one listened, but just because no one else needs to see that sentence, it doesn’t mean it’s not important for us to put the work into writing it down.
About this newsletter
I’m Randy Au, currently a Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. The Counting Stuff newsletter is a weekly data/tech blog about the less-than-sexy aspects about data science, UX research and tech. With occasional excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise noted.
Supporting this newsletter:
This newsletter is free, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Tweet me - Comments and questions are always welcome, they often inspire new posts
A small one-time donation at Ko-fi - Thanks to the folks who occasionally send a donation! I see the comments and read each one. I haven’t figured out a polite way of responding yet because distributed systems are hard. But it’s very appreciated!!!
I made a silly mug about prod going down, entirely for @kiersi’s amusement