Tools For Fighting The Data Monster

It's dangerous to go alone...

Dec 21, 2023

Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.

[Quick note from Randy]: Here’s part two of Ivan’s guest post about data quality.

The data monster mentioned in the previous post is something that data teams are dealing with or will have to deal with eventually. It’s an endemic problem in data and one that will continue to persist and even intensify as companies grow hungrier and eager for data and data insights that drive critical business decisions. This explosion in demand for data and quality insights has accelerated the demand for better validation and quality checking of data.

Many offerings have popped up to meet this growing demand and span the three different and equally important tiers of data validation and quality (as described in the previous blog post). These tools provide features within each of the different tiers but predominantly target one specific category. For instance Monte Carlo has offerings in data observability and quality. However it is predominantly focused on data observability. The next few sections will categorize each company according to their core offerings and according to their approach to data validation and quality. These data tools go a long way to addressing this data problem. But not all the way. They fall short of addressing long-tail and nuanced problems — the ones that cost the most time and effort. Even with these tools, a huge gap still persists — making it difficult for data teams to have complete confidence over their data. This post will discuss this dangerous gap and an intuitive and easy approach to bridge this void.

Quick Tour of the Current Tools

The following sections will provide a quick and dirty overview of the current tools that some data teams use to tackle their data quality issues. This overview won’t exhaustively cover the features of each tool but should provide useful guidelines when evaluating which to use.

Data Observability

Data observability is the big daddy category of all the various data quality and validation-related tools out there. Essentially this category is the most competitive and has some of the most mature offerings. These offerings usually come in the form of one-size-fits-all tools that can be applied across many data components (i.e. databases, data pipelines, etc.) of an organization. With generality comes tradeoffs (see next section). These tools are best suited for large organizations with lots of data and big budgets (costs for these tools scale with respect to data and can grow to tens of thousands if not more).

Monte Carlo specializes in end-to-end data observability. This focus on end-to-end coverage means that Monte Carlo allows teams to not just identify issues but to become more aware of where exactly the issue came from and fix it. They also come with many monitors (some AI-based) to catch issues.
Anomalo is similarly deployed across your entire data infrastructure. However Anomalo’s focus is on providing comprehensive anomaly detection to validate, document, and monitor for inconsistencies in data.
BigEye does many of the same things as the previous two companies but emphasizes their ease of use and integrations across many different data stacks.
Telmai uses AI-driven analytics for data monitoring in addition to features like anomaly detection.
Decube requires data teams to take a more proactive approach towards data quality management by defining many of their own tests on top of automated features and dashboards.

Data Quality

Many companies start in the data observability category and then slowly build features that tackle data quality according to their customers’ needs. All of the companies in the previous section have done just that. However there’s a subset of companies who place a stronger focus upon the data quality features within their offerings. The tools provided by these companies are not as mature but undoubtedly useful in tackling data quality issues within a company’s data.

Datafold is one of companies in this category and places a heavy emphasis on improving data quality through real-time data profiling and validation that occur right as code is changed. Features like data diff, column-level lineage, and anomaly detection ensure data accuracy and trustworthiness. Another company is Qualytics which tackles data quality through contextual data checks, anomaly detection, as well as remediation processes when there is an issue.

Data Integrity

Many companies have few if any features that target data integrity. This may be surprising given the fact that data integrity is an incredibly crucial component of becoming 99.9% confident in your data. However the truth is that data integrity is a lot of work and oftentimes requires a deep knowledge of what the data looks like in addition to having the technical capabilities of developing the precise and value-additive monitors for this data. There are some solutions that tackle this important area.

Great Expectations is one of the biggest and most well-known solutions to data integrity. It’s an open-source tool for data validation, profiling, and documentation. It allows teams to create and share Python-based data quality tests and expectations. These expectations can be general or highly specific — ensuring data integrity throughout the data lifecycle. Pandera is another open-source library that tackles data integrity. Like Great Expectations, Pandera focuses on data testing and validation in the Python ecosystem. It offers a flexible and expressive API for defining data schemas and validating data frames. However it lacks the flexibility to encode highly specific tests in the same way that Great Expectations allows for.

Does this solve all my data issues?

As you can see, there are a lot of tools out there that help you tackle data-related issues. In theory, the more the merrier. At a glance that statement seems to hold up — we’ve covered the entire gauntlet of potential data issues. However a deeper look at each of the offerings reveals that there are still major shortcomings for a new developer wishing to have more confidence in their data.

That’s not to say that the existing offerings offer no value at all. The tools noted above are good at tackling their own respective issues. The data observability tools provide a good coverage of all major data observability-related issues that one might run into. Likewise, the data quality and integrity tools do a good job of allowing developers to catch problems that are more nuanced.

But these offerings come with drawbacks. One big drawback for data observability tools is that they require a lot of access. In order to be effective, these tools have to have access to data-related systems across the entire organization. This means that getting started requires a lot of buy-in from the entire organization and that using the tool is expensive (given the typical SaaS sales approach of these tools). The price of these tools become the ultimate distinguishing factor — especially when you compare the interfaces of data observability and quality tools (aside from Datafold) and see that they are more or less the same as each other. Finally due to the top-down nature of these data quality and observability tools, these tools have to appeal to people who lead an organization’s data and more importantly the business users and analysts who are generating value from and forking over large sums of money for these tools.

These business users are a non-/less technical crowd. A non-technical focused tool provides an optimal solution for analysts and business users but leaves out more technical users like data engineers and data scientists. This limits the total value that these tools provide since data engineers and data scientists are the ones who truly understand the data. These technical users are managing data pipelines and are the ones who address pipeline issues. Data engineers and scientists want to catch data issues as soon as possible for faster and better resolution. Analysts and business users interact with data at the end of the pipeline (i.e. dashboards, insights, analytics, etc.). They test the data at the end.

Testing data at the end has its own set of notable drawbacks. First off, it is counterintuitive: data at the end has already been transformed and had many downstream impacts — exponentially complicating remediation efforts. Finally these tests are ultimately limited by the capabilities of SQL which is impressive but by no means exhaustive — more complicated and more insightful data monitors that would give full confidence in the data cannot be implemented.

That’s not to say that there are no offerings that cater to a more technical audience. Data integrity tools like Great Expectations and Pandera tackle the need for a more technical developer-friendly tool that monitors complicated data at every step. However these tools also have glaring gaps. Most data teams have existing custom data monitors. To use these tools, these teams would have to completely replace these existing tests and rewrite them in the format of the tool. For example a data scientist who might use a custom model to test the data will be unable to use the same test in these frameworks. That is all to say that these more technical tools are more flexible but still leave much to be desired.

Shift-left

So what is the most optimal solution? Is there an optimal solution? Such a solution would have to account for all the unique characteristics related to the testing of data. Existing solutions don’t do this. First existing solutions don’t meet data teams where they’re at. Data teams already have data pipeline setups/platforms and potentially even data checks within these pipelines. Existing tools require these teams to completely replace these existing checks with checks from the framework — a process that is time-intensive and comes with its own set of drawbacks. These existing solutions are ill-equipped to tackle unstructured data — these data testing tools work within the structured paradigm of columns and rows of data. As mentioned in the previous post, monitoring data requires the flexibility akin to infrastructure monitoring (due to its constantly changing nature) as well as the defined nature of unit tests (as there are clear assumptions that can be tested). Existing solutions do not manage this balance well — they’re either too structured or too general to provide value. Finally in order to get closer to 99% data confidence, an optimal solution should get everyone who manages the data — especially data engineers and scientists — involved in the process. By involving more people, an organization’s hive mind understanding of data can be used to create better informed and more useful data monitors.

All of these characteristics lends itself towards a more shift-left approach to data monitoring. If you’re like me and have never heard of the term before, shift-left (with respect to data monitoring) means integrating data monitoring into the data pipelines and infrastructure and at every step of the way (in turn shifting these monitors leftwards towards the start of the pipelines). This means taking a more proactive approach to data monitoring and not treating it as some afterthought. This approach allows issues to be caught as soon as possible and also reduces the impact radius of downstream issues.

Introducing Panda Patrol

Over the past few months, I’ve been working and iterating on a package that embraces all these unique needs and takes a shift-left approach to data monitoring. Panda Patrol is an open source Python-based package that can be added into any existing Python-based data pipeline at any step. It requires just one function call. It provides the following features:

General Monitors: Data monitors for accuracy, completeness, duplicates, enums, freshness, and volume that are run according to some heuristics
Anomaly Detection: Pre-built models to catch anomalous data issues
Dashboards for Custom Monitors: Custom dashboards by wrapping existing custom data monitors
Data Profiles: Pre-built profiles to provide higher-level details of the data in the pipelines
AI-Generated Monitors: Automatically generate data monitors based off the provided data
Monitor an Entire Step: Catch any errors that arises within a step of the pipeline

All of this requires minimal set up and provides a flexible and more developer-friendly approach to writing data monitors. Consider the simple Airflow dag:

...

+ from panda_patrol.checks import check
+ from panda_patrol.profilers import basic_data_profile

def get_values():
    data = pd.DataFrame(
        [
            {"id": 1, "value": 100},
            {"id": 2, "value": 200},
            {"id": 3, "value": 0},
        ]
    )
+   check(data, "Get Values")
+   basic_data_profile(data, "Get Values", "Report")

get_values_operator = PythonOperator(
    task_id="get_values_task",
    python_callable=get_values,
    dag=DAG(
        "get_values",
        description="Get Values DAG",
        start_date=datetime(2017, 3, 20),
        catchup=False,
    ),
)

By adding just two function calls right within your nodes, you can set up general data monitors and profiles within your pipelines. With Panda Patrol, data teams can quickly get started with data monitors — even if they don’t know what to monitor for — and scale confidence in their data with more comprehensive data monitoring. Check out this quick demo of some of the tools and the open-source package.

Why Now?

These data issues are endemic and have existed ever since data started being used. Big data exacerbated the issues. People started collecting data from everywhere and using data in every possible way imaginable. In between collection and data usage, data teams added a plethora of transformations that further morphed, grew, and complicated the data. Then tack on the downstream dependencies and users of the data and you end up with an insurmountable pile of data issues set to cause time-intensive trouble at a moment’s notice. The issues from big data are just the tip of the iceberg.

Data issues are occurring more frequently and with greater negative impacts. This is especially relevant when you factor in AI – requiring more unstructured and hard to manage data. These tailwinds make getting ahead of the data issues and taking a proactive approach towards data monitoring is more of a requirement than a nice-to-have. Otherwise data issues can quickly explode and become insurmountable. Higher quality data will also become increasingly important as AI practitioners seek to optimize the performance of their models. Data monitoring will additionally need to support these quality demands. Thus data monitoring needs to be flexible and accessible for all the technical stakeholders and managers of data. Panda Patrol provides a quick and scalable way for data teams to do just that.

Ok but …

Despite the urgency of these data challenges, there may still be understandable reservations. First tackling these data issues may seem like a huge undertaking. This is true. Data teams will have to invest a decent amount of time and resources. Good teams implement these data quality measures right from the start. Less prepared teams are forced to add these measures when an embarrassing data-related instance brings down production and/or causes them to be reprimanded by an executive. However the savings that result from a robust data monitoring infrastructure greatly outweigh these investments. A robust data monitoring interface reduces the number of data fires that always seem to occur at the least opportune of times. This data monitoring is also something that needs to be done eventually. Companies that procrastinate on this eventually reach an impasse with their massive quantities of data. More and more data causes companies to procrastinate more on their data issues – until a major data issue breaks everything. They’re then forced to work on their data quality – oftentimes settling (and paying a lot) for insufficient data quality tools that fall short of comprehensively addressing all data-related issues. These insufficient tools provide a band-aid rather than a remedy for these data issues. All this is to say that — like final papers and projects — it’s better to get started sooner rather than later.

One might also argue that this is a communication process issue rather than a technical challenge. This argument is valid. A lot of the required investment stems from the need for people interacting with the data — scientists, analysts, business users, and engineers — to communicate about their needs and demands for the data. An emphasis on data monitoring, especially with a more targeted and specific approach with Panda Patrol, supplements this communication. First, this emphasis forces people working with data to be cognizant and constantly thinking about its potential issues. It forces them to be in constant communication with other people dealing and managing the data. In contrast to other approaches like data contracts, monitoring provides a more real-time and accurate portrayal of the relationships between data stakeholders. Next, monitoring will provide a centralized hub where you can see the status of all the data assumptions that your organization has on the data and whether or not those assumptions hold (i.e. checks succeed or fail).

One final point is that — like hand sanitizer — in the best case scenario — you’ll be able to catch 99.9% of data issues. Data is constantly changing and as a result you’ll never know what might go wrong. That doesn’t mean that you should give up on catching these issues. Rather it means that an incremental approach needs to be taken and that an optimal solution needs to be flexible enough to account for the tail end of data issues. The ultimate goal is to manage and minimize problems rather than absolute data perfection.

The Elephant in the Room

However there’s still a big elephant in the room that hasn’t been touched on. Data monitoring is a part of a larger problem that people have been trying to address since they started using data. This problem is the gaps in data knowledge and communication within an organization. Ensuring that data teams are aligned and on the same page is crucial. However most data organizations lack this cohesive understanding of data. Some people have a more nuanced understanding of the data, others have a higher level understanding, and still others have little to no data understanding. These differences cause discrepancies as it relates to how people deal and manage their data; these discrepancies become increasingly pronounced when data breaks or a data issue arises. Data monitoring is just one big and useful piece of this massive puzzle.

Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

A guest post by

Ivan Zhang

Harvard student helping people with their data quality issues

Richard Careaga

Two outta three ain’t bad

Dec 21, 2023Liked by Ivan Zhang

Data quality (DQ?) comes in flavors. There is structured data collected and subject to a system of validation routines and controls and it’s possible to know what it should look like. There is a mature suite of statistical control tools to address this. Some is structured Big Data collected in the wild. The objective here is to measure the contributions to noise in aggregate to random variation. Again, well developed tools, such as ROC exist. Unstructured Big Data is yielding to AI tools.

In contrast, there is the realm of data small enough that individual GIGObytes drive the confidence intervals of any statistical test to widths making it worthless. Here we pass from the realms of applied statistics into plumbing the secrets of the human heart. Call it data psychology. It’s an area where analysts in the business units not organized around issues like Big Data spend the most time.

Self-inflicted wounds cover the data that make it into the analyst’s hands range from simple Excel contaminant to cognitive bias. But the biggest problem is the absence of considering “what question is this data supposed to answer?” Posing the question more often than not yields the blank stare signifying “I don’t know what I’m really looking for and haven’t a clue to how I can possibly expect to recognize it when it I see it.”

That is where a good data analyst has the opportunity to contribute something useful, by triggering the thinking to pose the question. It’s an art, not a science.

Expand full comment

Counting Stuff

Discussion about this post