Wow this was a a very well written article. While I was reading it felt as if you were directly speaking to me. Not only that, the eloquent way you broke down each point helped me to feel your struggle even though I know very little about this topic.
Never had I known that there was this data cleaning subculture within the tech community. As you said, if there is some way this could be automated then it would provide a lot of value to many people.
I skimmed, looking at the headers, but I'll come back to it. I have to clean data using SQL Server. If cleaning data is so important, and it is, why are there no courses on how to use SQL for ETL. Maybe I'm just missing them. I have found posts covering topics such as removing non-ANSI characters, but in general, I feel it's a good-luck-learn-on-your-own topic. You see dozens of SQL courses for Data Science, Data Analytics, etc. If anyone has seen or taken such a course, please let me know. Thanks.
The short answer to why you don't see direct guides for "Use SQL for ETL" is because SQL, by itself, is insufficient to do ETL. SQL allows you to access data and then specify the shape of it as its returned. You can even save that output into other tables via append or creation commands. There's often no tutorials for how to do transformations in SQL because it all depends on what you need to do. There's math/aggregation functions to crunch numbers, date/time functions to handle dates, string functions to change strings. You learn the functions one by one as you discover a need for that operation.
BUT even if you master those functions, SQL isn't enough. There's not many mechanisms within SQL to trigger when to run your ETL query at a certain time or event. There's no facility to handle dependencies between queries in a multi-step transformation job. There's no facility for handling failures and restarts. For all of those features and more, you need the help of other software -- either a framework like Airflow or whatever, or a bunch of custom scripts that execute the ETL in the way that's needed. Most of those frameworks have ways within them that allow you to use SQL to do the data work.
Thank you, Randy, for the long "short answer." Most of my "ETL" is manipulating strings, dates, running aggregations, updating tables, creating tables, etc. on imported .txt or .csv files, so it's not too hairy, but it has been a "groan-as-you-go" process. I've watched developers use Informatica, but it looked awful with all those tables and arrows, with more features and tools than I think I'd ever need. I'll take a look at Airflow though. All the best.
My last job was essentially a data cleaner and then a data cleaner trainer (although with a different title, thankfully). I actually quite enjoyed it (in between hating it and lots of frustration, of course!) and loved training people to open their eyes up to creating an intimate relationship with their data while learning, cleaning, and constantly questioning! :)
This is fantastic. Finally someone talking about this awful, horrible and utterly necessary step in nearly any data scenario. I loved the Philly spelling grid! I have several of those types of maps built, but that particular one really underscores how unpredictable data patterns can be, especially considering human entered input via UI when the UI is tantamount to a drunken goalie willing to let nearly any value pass by.
And to your point, you can't accommodate every cleaning pattern, there will always be one-offs and plenty of them, but you can accommodate MANY and in doing so studiously along the way, over time there is a lot less hassle and heartache involved when considering taking on a large scale project when you know you have only to come up with 'custom' handlers for data,
Thank you for this great post. It changed my mind! I noticed a typo. Isn't it "an unpleasant" instead of "a unpleasant"? I may be wrong (i'm not a native speaker :)).
Fantastic post! Think I'll set myself a calendar reminder to read it again (more than once!) in the future!
Wow this was a a very well written article. While I was reading it felt as if you were directly speaking to me. Not only that, the eloquent way you broke down each point helped me to feel your struggle even though I know very little about this topic.
Never had I known that there was this data cleaning subculture within the tech community. As you said, if there is some way this could be automated then it would provide a lot of value to many people.
Great job!
I skimmed, looking at the headers, but I'll come back to it. I have to clean data using SQL Server. If cleaning data is so important, and it is, why are there no courses on how to use SQL for ETL. Maybe I'm just missing them. I have found posts covering topics such as removing non-ANSI characters, but in general, I feel it's a good-luck-learn-on-your-own topic. You see dozens of SQL courses for Data Science, Data Analytics, etc. If anyone has seen or taken such a course, please let me know. Thanks.
The short answer to why you don't see direct guides for "Use SQL for ETL" is because SQL, by itself, is insufficient to do ETL. SQL allows you to access data and then specify the shape of it as its returned. You can even save that output into other tables via append or creation commands. There's often no tutorials for how to do transformations in SQL because it all depends on what you need to do. There's math/aggregation functions to crunch numbers, date/time functions to handle dates, string functions to change strings. You learn the functions one by one as you discover a need for that operation.
BUT even if you master those functions, SQL isn't enough. There's not many mechanisms within SQL to trigger when to run your ETL query at a certain time or event. There's no facility to handle dependencies between queries in a multi-step transformation job. There's no facility for handling failures and restarts. For all of those features and more, you need the help of other software -- either a framework like Airflow or whatever, or a bunch of custom scripts that execute the ETL in the way that's needed. Most of those frameworks have ways within them that allow you to use SQL to do the data work.
Thank you, Randy, for the long "short answer." Most of my "ETL" is manipulating strings, dates, running aggregations, updating tables, creating tables, etc. on imported .txt or .csv files, so it's not too hairy, but it has been a "groan-as-you-go" process. I've watched developers use Informatica, but it looked awful with all those tables and arrows, with more features and tools than I think I'd ever need. I'll take a look at Airflow though. All the best.
I know this article was written a while back but just seeing it now. Great depth on an under-thought and over-talked topic. I wonder, what are your thoughts on an article I wrote in 2018 that cleanliness is, by definition, subjective? https://www.linkedin.com/pulse/practical-definition-data-cleansing-aka-wrangling-matt-wong-cfa/
My last job was essentially a data cleaner and then a data cleaner trainer (although with a different title, thankfully). I actually quite enjoyed it (in between hating it and lots of frustration, of course!) and loved training people to open their eyes up to creating an intimate relationship with their data while learning, cleaning, and constantly questioning! :)
The https://en.wikipedia.org/wiki/Climatic_Research_Unit_email_controversy highlighted how much expertise goes into data cleaning in Climate Research
This is fantastic. Finally someone talking about this awful, horrible and utterly necessary step in nearly any data scenario. I loved the Philly spelling grid! I have several of those types of maps built, but that particular one really underscores how unpredictable data patterns can be, especially considering human entered input via UI when the UI is tantamount to a drunken goalie willing to let nearly any value pass by.
And to your point, you can't accommodate every cleaning pattern, there will always be one-offs and plenty of them, but you can accommodate MANY and in doing so studiously along the way, over time there is a lot less hassle and heartache involved when considering taking on a large scale project when you know you have only to come up with 'custom' handlers for data,
Gained a reader here - very nicely done.
Nice Article on Data Cleaning versus Analysis. Keep it up Randy.
Thank you for this great post. It changed my mind! I noticed a typo. Isn't it "an unpleasant" instead of "a unpleasant"? I may be wrong (i'm not a native speaker :)).
No you're correct, it's a typo from a previous edit