Balancing Who Handles Data Inconsistency
In Production, things inevitably get wonky, and that’s normal
In Production, things inevitably get wonky, and that’s normal
If there’s one thing you learn over the years while working with data, it’s that data quality “guarantees” are essentially well-intentioned fictions. Any time I get to touch an unfamiliar production database, I inevitably do some simple consistency checking, simple things like “does this unique ID actually show up only once?” “Are there duplicated events?” “Are these events that should be co-occur exactly 1-to-1 really 1-to-1?”
Every time I check on a system that’s been running for a year or more, there’s usually a very good chance at least some of those truths have been broken in a limited fashion. And in my opinion, within some reasonable bounds, this is okay. Because the cost of dealing with/preventing issues differs depending on the team.
Do the easy Eng stuff when you can, but eventually it gets too expensive
Why? All this stuff is supposed to be guaranteed by the database system! It’ll never happen!
The thing about contracts (both in life and in software) is that unless they’re enforced, unless there’s actual teeth and consequences, they’re often ignored whenever it is convenient for one or both parties. There’s also a cost to enforcement that might not be worth paying.
In the production systems I’ve worked with, things that typically enforce consistency constraints, such as foreign key and check constraints, were explicitly not used in the database for performance reasons or don’t exist (like those dark NoSQL “ACID? that’s a drug right?” days). It’s all just a social contract “enforced” within the application layer, not the DB layer. So the only reason there’s no duplicate row entry in the the event log table is because the app knows to not write a duplicated one.
Other times, there were built-in features such as auto-incrementing IDs that made sense in a single-host environment but eventually had complexity bolted on, like sharding, then multi-regional sharding. As complexity grew, that logic started shifting into the application layer and becomes harder to enforce.
Most of the time, this all works surprisingly well thanks to the engineers who architect the systems and anticipate the obvious issues. The expected behavior is verified in tests and QA, life generally proceeds.
Problems only tend to creep up when systems run in production for a time and the system starts accumulating unforeseen issues. Inevitably, a system will go down, or a bug in the app is introduced. Things get even more insane when you have to deal with a distributed system which is much more likely to have unexpected and utterly bizarre failure modes. This is how wonky data gets around these soft “guarantees” and into your data set. Sometimes even “hard” guarantees get blown up because there was an unexpected failure mode that uncovers a bug.
What? If you turn off consistency guarantees, you deserve what you get!
I’m fairly sure this isn’t Stockholm Syndrome
Sure? But here’s the thing, what happens if there really was a duplicated row (or a missing one) in your data? If you’re a bank and this is your transaction log, it’s a huge deal, even if it happens just once every 10 trillion transactions. If you’re just tracking how many users clicked a button, you’re not likely to even notice.
Pragmatism is important.
Meanwhile, the cost for maintaining consistency contracts can be significant. Database foreign key constraints might be okay, or not, depending on your exact hardware, db workload and schema. You need to do actual tests and benchmarks to figure out what the cost for FKs are. If you needed to guarantee read/write transaction consistency across the whole planet, you either had to wait for the speed of light, or come up with crazy new algorithms like with Spanner.
You could do after-the-fact checks to sound alarm bells if an issue is detected, but that’s a whole new suite of tests you need to write and maintain as the app evolves. Plus, it requires you to proactively anticipate the consistency issues to check for.
For big NoSQL systems where you might only have eventual consistency, there might not be something you can do about it without switching to an entirely different system architecture (and all the costs involved in that).
Relax, it’s probably fine
— What? But my data is messed up! I literally have a duplicate ID in a table where that field that’s supposed to be unique (true story). My world is all lies. How can this ever be “fine”? You’re crazy.
All that’s happened in you’ve found evidence of a bug, take a breath and look at what was affected. Start the debugging process. Did you make a mistake? What are it’s main effects? Are there critical systems that are affected? Is it an ongoing issue, or has it stopped already? Is there a signature you can use to identify problem areas?
Once you have a sense of what’s going on (and that it’s not a mistake), file a bug with an appropriate severity. If it’s a critical core business data source, it might warrant a P0.
Next, determine if you can work around the damage caused by the incident. For things that have a strong human element in them, like ad-hoc and exploratory work, can you work around the issue by identifying the faulty rows somehow? Can you do your analysis in degraded state with less fine-grained metrics like using unique counts and correlated proxy stats that are unaffected?
Things are more complicated with automated data pipelines since those aren’t as flexible. But still, depending on what your models and systems are doing, it might be within acceptable error. You have to test to find out.
Core problem: we’re balancing the cost/benefit of guarding against data issues
Murphy’s Law applies to all things, every system will fail eventually. Perfectly clean data exists only in toy problem sets from school. The question is how many resources are you willing to invest in to guard against progressively rarer events.
As an organization, you’re balancing the ability the ability to prevent/catch data issues with engineering before things happen, with the time and risk of having to deal with the aftermath when something does happen.
The engineering of consistency checks and constraints can be expensive, both from raw computational resources and also in engineering hours to build and maintain. Plus it’s another system that can potentially fail! Sure there are a few relatively cheap things you can architect in from the start, but let’s admit that many of these efforts are typically… bolted on later.
On the other hand, if you don’t do a lot of consistency checking up front, when you find bad data you as the data scientist will be forced to deal with it.
So what can a data scientist do to deal with the mess?
Ad-hoc/exploratory processes
As I mentioned before, it depends on the type of DS work that’s done. Compared to a production system or pipeline, analysis and exploratory research is a very human and flexible thing. Since there’s lots of experimentation and exploration built into the process, it’s inherently more capable of handling bizarre data errors that would take a ton of engineering work to prevent or detect.
The power to use approximations is probably the strongest weapon we have when dealing with messed up data. Given what we know about how everything works and the nature of the bug creating bad data, it’s possible to at least derive useful approximations, upper or lower bounds that can at least provide partial information.
It’s thanks to this ability to use degraded data sets that is why if I were forced to choose a side, I’d devote less resources on ensuring up front data consistency on data sets that aren’t mission critical. It’s more efficient to work around incidental bugs during occasional consumption.
You still need the data to be reasonably clean and reliable! But you shouldn’t be trying to plan ahead for every single black swan event you can imagine. Things will go wrong and you’ll deal with it when it happens.
Automated processes
Depending on the specific situation, data issues can have a huge effect on a model, or no effect at all. Only you as model designer will be able to know. But even with the huge range of possibilities there are a few common situations we can talk about.
If the data issue has been around a long time, you might not need to do anything immediately because the model was literally trained on the bad data and was launched after being judged to be giving reasonable output. Ironically, you need to test if your model can handle receiving clean and bugfixed data. This brings up all sorts of epistemological questions about the original validity of the model, but from a pure black-box output viewpoint it works.
Similarly, since many models prioritize the use of recent data, old and bad data have a tendency to “age out” of the system. It might not even be worth fixing the historic data if a fix would take longer than the month or whatever it is for things to self correct.
It’s also rare (but not unheard of) that a data issue will trigger a catastrophic re-analysis and rewrite of a model because the models tend to pull in many inputs together. A data bug would need to affect a really wide area to affect a bunch of model inputs because almost all models have a feature reduction step that tries to minimize the correlation between variables.
One example of such a bug would be if important data points were dropped in a very biased manner, introducing a bunch of skew into distributions. Something like a system that systematically fails to identify purchases from large customers that are only from the state of California. Hopefully these sorts of bugs are big enough that they are found very quickly because they should have big effects on other things in the business.
So we’re just supposed to accept that data quality takes a back seat?
NO! We’re not doormats..
What I’m saying is the data science process has more natural shock absorbance. The marginal cost of improvement curves for data quality and vs human-in-the-loop analytics processes are very different and there is a point where it makes sense to say “ok our data consistency safeguards are as good as we can afford to make them now”.
When do we know you have to improve data consistency safeguards?
The responsible answer is when you realize that some process and piece of data has become more important in your business and therefore you should invest more in it. This means you need to do periodic reviews and be proactive. It is really hard to do well and I only really do it when something is in active development and I’m actively thinking about these issues.
The honest answer is that when you find a bug and are made to realize how important something is. These things can easily sneak up on even the best of us as systems change. We might not have visibility into all possible changes and we never have 100% clarity over the side effects of every last aspect of a system.
It’s always a balanced dance. You’ll never be fully comfortable with what you have, and that discomfort is probably a good sign that you’re close to a manageable spot.