12 Comments

Fantastic post! Think I'll set myself a calendar reminder to read it again (more than once!) in the future!

Expand full comment
Sep 20, 2020Liked by Randy Au

Wow this was a a very well written article. While I was reading it felt as if you were directly speaking to me. Not only that, the eloquent way you broke down each point helped me to feel your struggle even though I know very little about this topic.

Never had I known that there was this data cleaning subculture within the tech community. As you said, if there is some way this could be automated then it would provide a lot of value to many people.

Great job!

Expand full comment

I skimmed, looking at the headers, but I'll come back to it. I have to clean data using SQL Server. If cleaning data is so important, and it is, why are there no courses on how to use SQL for ETL. Maybe I'm just missing them. I have found posts covering topics such as removing non-ANSI characters, but in general, I feel it's a good-luck-learn-on-your-own topic. You see dozens of SQL courses for Data Science, Data Analytics, etc. If anyone has seen or taken such a course, please let me know. Thanks.

Expand full comment

I know this article was written a while back but just seeing it now. Great depth on an under-thought and over-talked topic. I wonder, what are your thoughts on an article I wrote in 2018 that cleanliness is, by definition, subjective? https://www.linkedin.com/pulse/practical-definition-data-cleansing-aka-wrangling-matt-wong-cfa/

Expand full comment

My last job was essentially a data cleaner and then a data cleaner trainer (although with a different title, thankfully). I actually quite enjoyed it (in between hating it and lots of frustration, of course!) and loved training people to open their eyes up to creating an intimate relationship with their data while learning, cleaning, and constantly questioning! :)

Expand full comment

The https://en.wikipedia.org/wiki/Climatic_Research_Unit_email_controversy highlighted how much expertise goes into data cleaning in Climate Research

Expand full comment

This is fantastic. Finally someone talking about this awful, horrible and utterly necessary step in nearly any data scenario. I loved the Philly spelling grid! I have several of those types of maps built, but that particular one really underscores how unpredictable data patterns can be, especially considering human entered input via UI when the UI is tantamount to a drunken goalie willing to let nearly any value pass by.

And to your point, you can't accommodate every cleaning pattern, there will always be one-offs and plenty of them, but you can accommodate MANY and in doing so studiously along the way, over time there is a lot less hassle and heartache involved when considering taking on a large scale project when you know you have only to come up with 'custom' handlers for data,

Gained a reader here - very nicely done.

Expand full comment

Nice Article on Data Cleaning versus Analysis. Keep it up Randy.

Expand full comment

Thank you for this great post. It changed my mind! I noticed a typo. Isn't it "an unpleasant" instead of "a unpleasant"? I may be wrong (i'm not a native speaker :)).

Expand full comment