Why's it hard to teach data cleaning?
Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.
Vicki Boykis did a great post about ghost knowledge this past week. “Ghost knowledge” being the term she cites from David Maclever for unwritten knowledge that’s only known to people of a field, like whether a certain journal has a good reputation or not. It’s stuff that’s only learned from being immersed in an area.
In that post she mentions things that represents ghost knowledge in the data sciences to her, and one of the examples she mentions is how cleaning data isn’t (or can’t really be) taught. It’s something everyone seems to learn from pained experience. I don’t disagree with that idea, because I can’t think of a good way to teach someone how to clean data.
But I’ve never thought about WHY things are this way. Does it have to be?
So here’s me thinking about that.
First, a 5 second recap on data cleaning
To avoid rehashing things, I wrote an epic-length post about the nature of data cleaning, and some rough advice, back when I wrote about how Data cleaning is analysis, not grunt work. Here’s the tl;dr of that post:
Data cleaning is transforming data with intent and purpose, that purpose being to complete an analysis
Data transformations span a spectrum of reusability/generalizability, striping out malicious SQL injection input from form data is pretty reusable as a drop-in, reweighting a survey sample to account for an unexpected bias is less so
The more repeatable and generalizable the transformation operation is, the more people are likely to write it off as uninteresting “cleaning”
Upon this backdrop, let’s start poking at why it’s hard to teach how to clean data.
Cleaning is directly tied to the specific analysis
Under my proposed framework for thinking about data cleaning as actually being analysis, you’re forced to accept that what you’re doing to clean and prep your data for use in analysis is inextricably tied to the analysis that you’re doing. There’s no way to separate the two.
Because I need to a certain, very specific, analysis on my data, I need my data to have certain properties.
If I’m going to be doing a time series analysis on say, temperature data, I’m at the very least going to have to check, and make corrections, for any issues that mess with the steady flow of time in my data. Meanwhile, if I have no plans on utilizing the time dimension, but instead cared about location, I’d be spending my energy making sure the location data is sensible. If I cared about BOTH dimensions, I’d have to make sure both were functioning properly.
And “properly” here is a very important key word. Only I, the analyst, would know what looks appropriate or not for a certain bit of data. If my temperature sensors are firmly attached to a permanently fixed structure, I should be very concerned if the location data indicated it was moving (outside of GPS jitter, etc). The reverse is true if the sensor was attached to a living migratory bird during migration season.
There’s no algorithm in the world that can tell me this specific bit of domain knowledge that’s unique to my exact circumstance.
So one reason why everyone winds up painfully learning how to clean data themselves, and appreciate data cleanliness, is because every different analysis requires the data to be a certain unique way. Software packages need data formatted a certain way. Algorithms might crash unless the data types are just right. One thing needs missing values to be handled one way and another algorithm needs it handled a different way.
It’s complete chaos.
Analysis, and thus cleaning, is tied to specific domains
After well over a decade of cleaning data in the web/tech/startup space, I’ve gotten pretty good at it. I have a pretty good sense of what my data needs to look like for me to work effectively in it. This means I know what to look for early on, and can anticipate problems before they become and issue. This is extremely helpful now that I spend more and more time figuring out how to collect data, not just analyze what’s available.
But this knowledge is domain specific.
Sure, the domain of web/tech/startup data is immense, and so I’m not going to be running out of potential employers anytime soon. But if you sit me down in front of something outside that domain, like healthcare telemetry, astronomy, particle physics, financial data, I’m going to be largely lost.
The reason why I’d be lost even in the data cleaning step for such fields isn’t because I don’t know how numbers behave. Nor is it that I don’t have an knowledge of very general data cleaning strategies that would likely work. It’s because I don’t know what the correct analysis looks like!
One field might hang me if I said “I threw it into a regression and these are the the factors that fell out”, while another would be perfectly convinced by it. The standards of what constitutes sufficient evidence differs because it’s a socially-constructed decision. Some allow quasi-experiments, and some demand randomized controlled trials.
Even if multiple fields use the same methods, like say, linear regression that takes a boring table of data points, the devilish details about what’s considered a good variable or not will differ. Some fields have well-researched factors that are used for certain concepts and you’d get looked at sideways if you naively invent your own.
Without specific domain knowledge about what the analysis is supposed to look like at the end, I’m at a serious disadvantage as to figuring out how to get my data to a form that will support the desired outcome.
Since I think most people talking and writing about data these days are primarily focused upon single industries with shared contexts, the problem of the difference between domains has been understated. Very few people hop between entire industries, and very few of the people who do wind up mentioning it.
Come to think, the people swapping industries might not even realize that they’re re-learning how to clean and interpret data during their transition. On the surface, it pretty much feels like just “learning how to do analysis to new standards”.
Data collection is tied to the specific domain
Thus far I’ve been harping about how things “downstream” of the data cleaning operation have strong influence upon what cleaning needs to actually be done. If only it were that simple! Because things upstream have profound ramifications too!
In many situations, data collection processes are intimately tied to the base research question. In fact, the concept has been created to describe situations where data collection isn’t tied to the original purpose. That unique situation is called “found data” and there are unique challenges in using that sort of data properly in an analysis.
So you’re working in one of two situations. One, you’re working on data that’s built for the analysis/research at hand, meaning you need to understand the original intent, verify that the intent was actually achieved, and fixing anything that seems wrong there. Or, you’re working with found data, and have to now understand what the mismatches between the original data collection intent, and the research question’s needs are.
Either one requires domain knowledge. It’s not advisable to just pick things to clean arbitrarily.
So to sum up, “data cleaning” is deeply embedded into an unbroken chain of meaningful decisions starting at research question, data collection, to finally data analysis and results interpretation. That’s why it feels impossible to teach just “Cleaning”, there’s no clear demarcation of where to slice that out.
So, is it possible to teach data cleaning?
I think it’s near impossible in the general case, because even experts in handling data can’t be blindly thrown into arbitrary fields and successfully clean without context and help. But we should be able to train people a couple of things that would prepare them for a life of handling messy data.
First, people need to decide what domain they care about
Since it’s obviously we can’t teach how to clean in the general case, people need to decide what they want to work with sooner than later. Pick the domain you want to slowly master, and realize that when you stray further away from that domain, the less your knowledge may apply.
Next, we need to expose people to real, messy, data
One of the more common complaints I hear about various courses and bootcamps that teach data science related topics is that they focus heavily upon modeling and tech. You see students learning how to make their machine learning algorithm based off cleaned, academic data sets.
Pedagogically, I can understand why that is. The academic data sets were created for research purposes, and those are focused primarily upon algorithm development. It’s not really intended to be a competition of data cleaning, so it’s in a fairly nice form already. The classes that use these data sets are also ostensibly about machine learning algorithms, not handling dirty data. It’s seen as a waste of precious student time having them reinvent the cleaning wheel before they can get on with the rest of their homework. All of those are pretty good reasons for why machine learning is taught that way. But people have been repurposing it as examples of what data work is like, when it’s only a snapshot of one phase of data work. It doesn’t teach you how to deal with crappy data.
At the same time, there’s actually very few publicly available data sets that are truly dirty and ugly. I recently wrote about using Wikipedia data as a source of dirty Big Data because it’s one of the few places it’s available. Most other dirty data sources are hidden away within companies.
So I think we need to be better at leading people to messy data, or encouraging them to generate their own data somehow.
Then, people need to do end-to-end analysis, with some help from experienced folk
As anyone who works with data will extol, data cleaning isn’t a discrete step. It’s an endless loop of trying to do an analysis, hitting a roadblock, going back to clean away the issue, redo the analysis, repeat. Occasionally, in a nightmare scenario, the analysis reveals that there’s no amount of cleaning that can make things right and new data must be collected.
People learning to clean data need to experience this loop. It’s the only way.
They need to own an analysis, have a research question they must answer, and struggle to find their way to the answer. They need to have people in the field, both upstream and downstream stakeholders, to interact with to absorb some expertise. We can’t shield them from this responsibility and ownership.
Instead, experienced people should be on hand to give advice on how to handle the data issues. It’s very easy to go down weird rabbit holes when you’re starting out, and we don’t want to do that too much.
But can someone learn it on their own?
Yes, but I think it takes significantly more effort than you’d expect.
Essentially you’d have to be willing to think really really hard about the entire data analysis pipeline, not just “the cleaning part”.
That’d include anticipating what stakeholders want, which is very hard without domain knowledge. Where would one get such domain knowledge? Very often it’s by working in a position that’s a stakeholder or partner of the data analyst. You’d either have the perspective needed to know what stakeholders want, or have seen things unfold.
I think this is actually why we see a lot of people successfully move from adjacent fields into data science. It’s the engineers who paid attention and leveraged their interest in data. It’s the designers or researchers who expanded their insights and seat at the product design table and wound up leveraging more and more data. While there’s some skill overlap, the added insight of anticipating those stakeholder needs is extremely valuable.
Luckily, there’s lots of people out there on the internet, on data twitter, who love to be helpful. And there’s tons of knowledge that is necessary to learn, but isn’t even under the “Data science” label. There’s lots of technical writing by engineers, for engineers, that explain how various systems work. There’s lots of material out there that teach armies of graduate students how to ask good research questions and collect data.
Finally, when you have no other choice, there’s probably academic work that’s available that will show some parts about their data processing in their methods. You won’t see every single little detail, but should be able to follow the general form of the process.
About this newsletter
I’m Randy Au, currently a quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. The Counting Stuff newsletter is a weekly data/tech blog about the less-than-sexy aspects about data science, UX research and tech. With occasional excursions into other fun topics.
Comments and questions are always welcome, they often give me inspiration for new posts. Tweet me. Always feel free to share these free newsletter posts with others.
All photos/drawings used are taken/created by Randy unless otherwise noted.
If you’d like to support me with a cup of tea or subscription, check out my Ko-fi.
Some Counting Stuff swag from RedBubble is available for print-on-demand here.