The Big Bad Data Monster

Working with Data Quality

Dec 19, 2023

Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.

[Quick note from Randy] A pair of guest posts on data quality this week! Part one is today, part two will be sent this Thursday

Hey! My name is Ivan Zhang and I’ve been a long time reader of Randy’s blog. It’s nice to meet you all. I’ve been reading and talking to a lot of experienced data engineers and scientists in the data testing/monitoring space for a while now. I am also building a tool called Panda Patrol around the same issue. I wrote this post to consolidate some of my perspectives – as influenced by my software development and ML background – regarding the subject. I hope you enjoy it.

The Big Bad Data Monster

I still reminisce of the early days when I first became acquainted with data. Data was the magical life force that would go into my models and programs and in turn power beautiful visualizations as well as useful and actually accurate insights. But data is not the person that you meet and become life-long friends with. It’s like the friend of a friend who you meet for the first time and have a good time with.

And then it goes downhill. You slowly lose the rose-colored glasses; this stranger sneezes into their hands (instead of their elbows), leaves a mess wherever they go, and other minor and/or major things that creates an internal sense of anger and distaste. You slowly but surely start to see its many flaws. You built a cool customer data dashboard — well now it’s broken and it’s your responsibility to find the problem and fix it.

How do I fix it? Maybe the data just didn’t come in. Maybe someone changed the name of a column. Maybe someone inputted an extra zero or random character in one of the columns. Oh god, I really hope it isn’t some random, hard to find, and upstream problem that will make me want to pull all my hair out, cry, and take at least several days to find and fix. (Unfortunately it’s probably the last case.)

The funny (and unfortunate) thing is that this is just the tip of the iceberg. Imagine how much more traumatizing this all is when you layer on more data sources that are each very diverse and complicated. Now imagine if you don’t own some data sources (i.e. they’re coming in from external third-party sources) and someone from some Random Data Provider decides that they want to denominate payments in cents instead of dollars. Even better still they decide to do it at 2 PM which is coincidentally 2 AM in your timezone. Everything breaks, you get pinged, you have a terrible day, and you slowly but surely start to hate the data monster.

This is a problem that has existed ever since people started using data but has become even more acute as people started using more data from more sources. But to get at the pot of gold guarded by the data monster, people have also devised similarly powerful weapons to attack this messy data monster.

This is actually not a 100% correct analogy. This data monster is more akin to a Hydra. Fix one problem, introduce many others. Specifically it would be stupid for these weapons to attack the monster head on; automated fixes on erroneous data have a high likelihood of introducing more problems and “give the data monster more heads”. Rather these tools are more defensive — limiting the damage caused by a data issue through monitoring and faster resolution.

Playing Defense

Not all data tools are created equal. The same applies to the set of tools that people use to tackle data issues. Some of these tools provide general protections against data issues — imagine you put a barrier between you and the monster. The next level of tools provide more nuanced protections that you might define yourself — imagine installing systems that go off whenever the monster starts moving or shows signs that it’s about to attack. The final level of tools are even more nuanced to the point where you’re monitoring for the important and specific details that could signal trouble to come — imagine monitoring and detecting the pattern of the monster’s movement, its heart rate, or even its facial expressions. (That is the end of this hopefully not too drawn out and somewhat useful monster analogy).

The next few sections will explore these defenses. The first section will compare the techniques that data practitioners use to monitor for data issues with techniques that engineers use to monitor for infrastructure and programs issues. The sections after that will explore the different categories of tools used to monitor data issues — starting with tools that tackle higher level data issues (i.e. data definition and schema) and narrowing all the way down to tools that validate data at a per-row level. These sections will summarize each category of tools and provide a high level overview of what each category aims to tackle. Each summary is somewhat opinionated (as with all cool data topics 🙃) but is motivated from the synthesized conversations that I’ve recently had with data quality practitioners that are way more experienced and smarter than I am (if you’re reading this, thank you guys!).

Similar but not the Same

Data monitoring as categorized above shares many parallels with the way that programs and code is monitored and tested. I’d like to think that data observability is akin to infrastructure monitoring where instead of monitoring your servers, you’re monitoring your tables and databases. This level of monitoring is similarly very aligned with respect to its generality; they cast a wide net over all the potential issues rather than targeting particular aspects that could break. The next level of data quality shares many attributes with integration testing. Both data quality and integration testing tools allow developers to better hone in and define their potential issues. This in turn gives you more context and information if something were to go wrong. But potential issues can still sneak their way through these checks. You now reach for an even more fine-grained tool. Like unit tests, data integrity tests are implemented to stomp out specific and nuanced issues that might arise in the data. It’s easy to assume that testing data is exactly like testing code. In fact many of the tools used for testing code (i.e. infrastructure monitoring) are hacked to be compatible with testing data. However this comparison goes as far as saying that data analysts are the same as data scientists — which is to say that it goes far but falls short of the target.

To draw parallels between testing code and data, I’m making the gross simplification of interchangeably using “monitoring data” and “testing data.” Data is monitored — not tested. This is the first core difference between code and data. Code remains static once it’s deployed. You can test this code and be relatively confident that the code won’t change on its own. You don’t have to continuously monitor code. Data is dynamic. Data coming in from other sources can change in unpredictable and varying ways. Thus the initial tests you have on the data have to become monitors that continuously check the incoming data. In this respect, testing data combines characteristics of unit testing and infrastructure monitoring.

Another major difference exists. Monitoring data is a lot harder than testing code. When you test code, you have a general sense of the edge cases and scenarios to test for. You can get a view of how well your code is tested through metrics like test coverage. Monitoring data is a lot more undefined. First you have to define what “good data” means. What is the expected range? What is the expected format? Are the relationships between data points accurate? And other more complex and nuanced questions that are specific to the data you’re using. After defining these attributes related to good data, you can start defining monitors. There are a general set of things that you can monitor for — out of range, invalid value, null value, and so forth. Then there is an infinite set of equally valid but more specific monitors. However when defining these monitors, there is an underlying challenge of balancing usefulness and specificity. Specific data monitors will catch specific issues. However these specific monitors are also more likely to throw false positives. The combination of false positives and alerting systems attached to data monitors may bring about an alert hell that you never escape from. This is all to say that monitoring data is a lot harder than testing code and has to be dealt with on many different levels — proceed with caution.

Data Observability - General

The first way for people to monitor data is through data observability tools. Data observability is like infrastructure observability but is tailored to specifically keep tabs on the quality of a company’s data and detect data issues. Imagine infrastructure observability but for data sources, pipelines, and infrastructure. These data observability tools cover a lot of the basic and common issues — addressed through techniques like automated monitoring, root cause analysis, and data health metrics. To be more specific, data observability usually takes the form of operational checks like pipeline status and delays, general dataset monitors like the latest date, and summary-level insights with respect to columns and rows.

That is to say that data observability goes a long way in addressing some of the more major issues when it comes to data quality. Things like “is the data being ingested,” “are the column values the expected type,” “do the rows contain the required fields,” and so forth. Unlike highly specific unit tests, these data tests look at the fundamental properties of data and are still general enough to make them valuable and applicable to many different types of data. However this generality comes with tradeoffs. One tradeoff is that these monitors detect the obvious problems that someone could’ve easily discovered by looking at their databases. This also means that these monitors reveal less information relative to more specific monitors. For example, you find out the values in a column are out of range but don’t know why exactly: was the original data unexpectedly high, did the denomination of the payments change, or was it a data issue that is far more nefarious and undetectable. Data that passes these general observability monitors are not necessarily good quality data. By casting a wide net, you catch all the big fish but lose out on all the smaller sized ones. These smaller sized ones are more likely to make you want to pull your hair out. To start catching these more nuanced issues, data practitioners turn to less general monitors.

Data Quality - Less General

For some data quality is encapsulated by data observability and they wouldn’t be necessarily wrong. (I feel like many data debates are like political or religious debates where no one is necessarily right or wrong but very opinionated nevertheless about whatever their stance is.) Data observability, data quality, and data integrity (detailed in the next section) exist on a spectrum and the intention here is to define some useful frameworks to bucket different tools.

Data quality is a more focused approach towards validating data by surfacing aggregate metrics. It provides useful axes through which data practitioners can more carefully and intentionally define ways to monitor their data. There are, unfortunately, many different axes out there that people use. Here is a lengthy but not necessarily comprehensive list (where I sometimes take a shortcut and use the word in its definition):

Accuracy: How well data reflects reality
Completeness: How complete (i.e. values exist) each column is
Consistency: If values are the same across different teams, tables, etc.
Uniqueness: How unique the values in a column are (i.e. particularly for id columns)
Freshness: The latest arrival/ingestion time of the data
Volume: The amount of data that is ingested
Schema: If columns are added and/or removed
Distribution: The data changes distributions (i.e. out of range, data is expected to be uniform but that is no longer the case, etc.)
Correctness: Does the data give the right/expected values after being plugging into a formula/algorithm

Just listing these axes doesn't detect these nuanced data issues. Rather tools operating within this category define more narrowly-scoped models and techniques to monitor data along several of these axes. This often comes in the form of SQL statements that come pre-built and/or are user-defined. For instance, imagine that you own a toy store and you know that your cheapest toy is $12 and the most expensive is $5000. To test the distribution of your transactions data, you can use the following SQL query:

SELECT
     MIN(sale_amount) AS min_sale_amount,
    MAX(sale_amount) AS max_sale_amount,
    CASE
         WHEN MIN(sale_amount) < 10 THEN 'Minimum below range'
        WHEN MAX(sale_amount) > 5000 THEN 'Maximum above range'
        ELSE 'Within range'
    END AS range_check
FROM     sales_data;

These SQL statements can be seen as the gatekeepers that observe data and monitor for potential malicious data points that are trying to sneak into your databases. In contrast to data observability tools, data quality allows practitioners to really start to hone in bad data detection. It takes a further step in ensuring that the data in the pipelines are good quality. Data that pass data quality monitors is fit to be used; users of this data can be confident in the same axes that it was monitored on (i.e. accuracy, freshness, etc.).

However data quality tools are still faced with the same issues that plague data observability tools. Poor quality and erroneous data can still find its way through and when it does will lead to even more massive headaches as you dismantle your data pipelines trying to find the originator of the problem. When done incorrectly and without much consideration, data quality tools can cause further headaches. Poorly defined tests can result in false positives and false alerts that further aggravate already stressed data teams. These poorly written data tests usually occur when a company decides to buy a convenient off-the-shelf tool rather than tackling the harder problem of establishing clear communications between the people (like data scientists and data engineers) who interact with and actually understand the data. These low signal data tests can give organizations an inflated sense of confidence in the quality of their data.

Data Integrity - Specific

This leads to the final boss of data monitoring. Once you have checked the data from a birds-eye view (via data observability) and aggregate view (via data quality), you check the data at the ground level through data integrity. Data integrity is like a surgical knife. You start to define specific tests on your data — things that stretch the limits of SQL and often start requiring Python to enable. These tests can include checking if an URL value actually works, looking for bias in column values, using specific models to detect fraud, and so forth.

In contrast to data observability and data quality, data integrity starts to care a lot more about the actual values of the data. Furthermore, when there is an issue, the monitor will give you a lot more context regarding the specific data issue. In many instances the failing monitor will tell you exactly how to go about resolving the data issues. Furthermore, data integrity monitors begin to touch on more operational aspects of the data. Oftentimes you use the data to generate metrics and insights. Data integrity tests can monitor these metrics and insights to get ahead of problems before they flow down the data pipeline.

And so?

We’ve established that sooner or later — with more diverse and bigger data pipelines — people start running into monstrous data issues that they have to address. These issues are becoming more prevalent as companies incorporate more data into every component of their software. This data is becoming more complicated and harder to handle (think unstructured data). Furthermore, these issues grow larger and more complex the longer that they’re ignored. All this is to say that the data monster is here to stay and that it’s better to start sooner rather than later. I’ve mentioned above the various categories of ways in which people can approach these issues. In the next blog, I’ll give more specific and concrete ways that one can tackle and deal with this issue.

This was a relatively lengthy post and I want to thank you for getting this far. This is my first time writing a blog post to a wide audience. I hope you enjoyed the read and got some useful insights out of it. I do have one ask. If you have any feedback and/or insights about this subject, feel free to comment or email me at ivanzhangofficial@gmail.com. Would love to improve and have some fun debates about these topics. Thank you!

Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

A guest post by

Ivan Zhang

Harvard student helping people with their data quality issues

Counting Stuff

Discussion about this post