12 Comments

Fantastic post! Think I'll set myself a calendar reminder to read it again (more than once!) in the future!

Expand full comment
Sep 20, 2020Liked by Randy Au

Wow this was a a very well written article. While I was reading it felt as if you were directly speaking to me. Not only that, the eloquent way you broke down each point helped me to feel your struggle even though I know very little about this topic.

Never had I known that there was this data cleaning subculture within the tech community. As you said, if there is some way this could be automated then it would provide a lot of value to many people.

Great job!

Expand full comment

I skimmed, looking at the headers, but I'll come back to it. I have to clean data using SQL Server. If cleaning data is so important, and it is, why are there no courses on how to use SQL for ETL. Maybe I'm just missing them. I have found posts covering topics such as removing non-ANSI characters, but in general, I feel it's a good-luck-learn-on-your-own topic. You see dozens of SQL courses for Data Science, Data Analytics, etc. If anyone has seen or taken such a course, please let me know. Thanks.

Expand full comment
author

The short answer to why you don't see direct guides for "Use SQL for ETL" is because SQL, by itself, is insufficient to do ETL. SQL allows you to access data and then specify the shape of it as its returned. You can even save that output into other tables via append or creation commands. There's often no tutorials for how to do transformations in SQL because it all depends on what you need to do. There's math/aggregation functions to crunch numbers, date/time functions to handle dates, string functions to change strings. You learn the functions one by one as you discover a need for that operation.

BUT even if you master those functions, SQL isn't enough. There's not many mechanisms within SQL to trigger when to run your ETL query at a certain time or event. There's no facility to handle dependencies between queries in a multi-step transformation job. There's no facility for handling failures and restarts. For all of those features and more, you need the help of other software -- either a framework like Airflow or whatever, or a bunch of custom scripts that execute the ETL in the way that's needed. Most of those frameworks have ways within them that allow you to use SQL to do the data work.

Expand full comment

Thank you, Randy, for the long "short answer." Most of my "ETL" is manipulating strings, dates, running aggregations, updating tables, creating tables, etc. on imported .txt or .csv files, so it's not too hairy, but it has been a "groan-as-you-go" process. I've watched developers use Informatica, but it looked awful with all those tables and arrows, with more features and tools than I think I'd ever need. I'll take a look at Airflow though. All the best.

Expand full comment

I know this article was written a while back but just seeing it now. Great depth on an under-thought and over-talked topic. I wonder, what are your thoughts on an article I wrote in 2018 that cleanliness is, by definition, subjective? https://www.linkedin.com/pulse/practical-definition-data-cleansing-aka-wrangling-matt-wong-cfa/

Expand full comment

My last job was essentially a data cleaner and then a data cleaner trainer (although with a different title, thankfully). I actually quite enjoyed it (in between hating it and lots of frustration, of course!) and loved training people to open their eyes up to creating an intimate relationship with their data while learning, cleaning, and constantly questioning! :)

Expand full comment

The https://en.wikipedia.org/wiki/Climatic_Research_Unit_email_controversy highlighted how much expertise goes into data cleaning in Climate Research

Expand full comment

This is fantastic. Finally someone talking about this awful, horrible and utterly necessary step in nearly any data scenario. I loved the Philly spelling grid! I have several of those types of maps built, but that particular one really underscores how unpredictable data patterns can be, especially considering human entered input via UI when the UI is tantamount to a drunken goalie willing to let nearly any value pass by.

And to your point, you can't accommodate every cleaning pattern, there will always be one-offs and plenty of them, but you can accommodate MANY and in doing so studiously along the way, over time there is a lot less hassle and heartache involved when considering taking on a large scale project when you know you have only to come up with 'custom' handlers for data,

Gained a reader here - very nicely done.

Expand full comment

Nice Article on Data Cleaning versus Analysis. Keep it up Randy.

Expand full comment

Thank you for this great post. It changed my mind! I noticed a typo. Isn't it "an unpleasant" instead of "a unpleasant"? I may be wrong (i'm not a native speaker :)).

Expand full comment
author

No you're correct, it's a typo from a previous edit

Expand full comment