Common Data Science Trap— Getting Systems To Agree
Beware! They will not agree! The question is to what degree
Beware! They will not agree! The question is to what degree
I see you. (Photo: Randy Au)
It is almost impossible to work in data science these days without having to work with multiple data systems that may have overlap in what things they cover.
Maybe you have a web site that uses both Google Analytics and an internal custom tracking system to collect analytics. Maybe you have one system that tracks web pixel hits and another that uses JavaScript fingerprinting. Or you have your analytics system but are comparing notes with a 3rd party.
Whenever there are multiple systems, one of the most fundamental questions to ask is whether the numbers from one agree with the numbers from the other. You would think that reconciling the systems would be easy to do.
But it isn’t.
Systems almost never agree unless they’ve been very specifically engineered to. They probably won’t agree even with that. Before you get sucked into a 3 week project that will leave you pulling your hair out, take a step back and understand what you’re up against.
Definition conflicts
Sometimes, two reports that are pulling data from the exact same underlying data set don’t agree. This boils down to a difference in the definition of something. Does your “Active user” metric include bots? Or maybe exclude users who haven’t purchased, or only counts users in the past 28 days instead of 30?
Because there are many ways to analyze a data set, things just have to be very clear as to what metric is measuring. Everyone also has to be careful not to use the same term for different definitions. This is especially true when working with people across different teams who may not be in close contact with one another and may have developed their own set of definitions independently.
Reconciling these often means looking into the data processing pipelines and checking the logic used to pull the data. Sometimes there are bugs in those implementations that you need to fix, and other times once the definitions are reconciled things match up exactly.
While many people would say that this problem could be solved/avoided with a proper data dictionary, in practice unless you have a very strong process for keeping the dictionary updated, it will only prove effective for the most ubiquitous and stable metrics. Things that change quickly will often become stale without a lot of enforcement.
Implementation conflicts
Most systems don’t measure things in the exact same way as another. Specifics matter a huge amount, even if the conceptual definition is the same.
For example, a system that tracks hits by looking at server requests logs will not agree with a system that requires a client to run JavaScript code. The client might not be able to run JavaScript.
In such a situation, the server logs would thus tend to count more hits than JavaScript… Unless you’re doing a 1-page design that updates the page without a direct page reload, but fire off a tracking event when logical reloads happen. Then it can swing the other way depending on your users,
This stuff can get complicated and as you refine down to smaller and smaller sets of “missing” counts, you have to dig deeper into smaller and smaller implementation details. This is why you have to be very intimately familiar with your tech stack if you want to work with the data. It’s never “clean” just because you’re the ones collecting it. There is always stuff you’re failing or unable to collect.
Flat out bugs
Bugs inevitably happen. They’re also not required to happen on 2 separate systems at once. Even if it affects both, it might be biased and not affect things in the same way. These sorts of issues should be transitory, in that there’s a big change in agreement when the bug gets introduced, and then when it gets fixed.
But sometimes these bugs aren’t found for years, and the only way it’s discovered is because you’re comparing metrics from the two systems and they just don’t agree no matter what you do.
The Two Generals’ Problem has it out for you
For systems that are distant to one another, in the sense that they interact via a potentially lossy channel, systems can simply disagree because of transmission loss between the systems. This could be due to network failure, or systems failure (a crash, bug, etc).
Two Generals' Problem
In computing, the Two Generals Problem is a thought experiment meant to illustrate the pitfalls and design challenges…en.wikipedia.org
The short explanation of the problem is that there are 2 generals that must attack a city between them, and they can only succeed if both attack at the same time. They must send a messenger to the other general across enemy territory with a time to attack, but the messenger may be captured. Similarly, a “message received” message going back can also be captured.
There’s no algorithm that guarantees a solution to the Two Generals’ Problem even with infinite messengers being sent on both sides. The only solution is to have protocols that increase the likelihood that both sides will agree on something while balancing the number of messengers being sent.
But wait, you ask. Don’t we have TCP? Hasn’t this been solved? Well, not quite. The TCP protocol (and similarly, other communications protocols) do have mechanisms in place to mitigate the effects of a lossy channel with its ability to resend dropped packets and such. But it is STILL fundamentally impossible for everything to have perfect knowledge of the state of the system at the same time. That TCP connection can break at any point.
I’m sure you think our various data systems have safeguards against such issues, consistency guarantees and such. Well, read those details very carefully, because they pertain to details within the software and are very specific. They also can’t guard against things that happen outside the software like a cat chewing a network cable.
In more practical terms, imagine your site makes two function calls upon loading track_hit_locally()
and track_hit_remotely()
. Both fire one after another, but one talks to a machine on the local network, while the other talks to a machine on the other side of the planet. Over time, with random internet outages and system errors piling up, the two counts will usually start to disagree unless there are specific mechanisms put in place to keep them in sync. Even then the sync mechanisms can fail.
To make things worse, when you’re comparing systems that you have no control over, you don’t have a mechanism to audit and insert the missing data. So the systems are just going to have to disagree. The question is just whether that disagreement is within acceptable limits.
Conflicts of interest and Fraud
Data doesn’t lie, until a human is motivated to make it lie. Then, it’s usually not very difficult. Sometimes systems don’t agree due for nefarious reasons that have nothing to do with technology or math.
One famous recent example is Facebook inflating their video metrics, effectively causing marketers to dedicate resources (and money) to the platform.
Another example is from back in my dark days in ad-tech during the 2008 financial crisis. The 3rd-rate shop I worked for was worked in the darker corners of internet ads, where clicks are passed through multiple layers of ad networks before reaching a principle advertiser.
Clicks (which meant actual revenue in the CPC world) regularly disagreed around 10–15% from one network to the next. Reasons included “click invalidated due to proprietary anti-fraud tech” and just random system errors. Sometimes this was valid fraud detection, but some percent was just shaving a few percent of margin extra to keep, essentially not paying the downstream publisher while keeping the money from the upstream advertiser.
The problem was that everyone did variations of this, up and down the ad network chain. So it was just routine for stuff to disagree, and everyone just nodded and smiled so long as it wasn’t too egregious.
Luckily, I’ve sworn off working on ads since 2009, so this story is a decade old. I wouldn’t be surprised if some of this is still going on in the world, but I am thankfully ignorant of the whole situation.
So beware, sometimes numbers don’t add up because someone doesn’t want them to add up. Proving it can be downright impossible given all the other sources of discrepancy and who controls the data sources.
Dealing with Conflicts
With all the ways counts between systems can differ, what can you do about it?
Be clear and consistent
Definitions and implementations of metrics will always be arbitrary constructs that people have to memorize by repeatedly using them over time. The trick is to be clear and consistent in your usage. Don’t constantly redefine things, don’t have competing implementations. Sometimes a slightly flawed metric that has a long history of usage may be easier to reason about than a more accurate one.
That’s not to say that you should never improve your measurement methodology, but that it must be done in a way that every user can follow along. Ideally, improvements build upon existing knowledge so that people don’t have to relearn material.
Don’t chase diminishing returns
The 80/20 rule applies to sources of data errors. Usually, a small handful of issues cause the majority of the discrepancy. So when working on a project to track down issues, remember that there are diminishing returns to your efforts.
How much is it worth to you to make sure your metrics are within 10% of each other? How much to get within 1%? 0.1%? How much is it worth to have exact agreement?
The closer you want things, the more time and resources you’re going to spend to track down issues and to correct/prevent them. For many applications, having a slight disagreement is probably fine. You’ll likely want things involving money to be closer together, but even with money the accounting team might be willing to accept a certain level of error if it’s small enough.
And then some systems absolutely need to agree, in those cases you engineer around that and pay the high costs to get it there. This will involve keeping systems redundant and highly available, have the ability to log and replay transactions, periodic consistency checks and audits, etc.
Be methodical
Hunting discrepancies takes a LOT of time. It is really easy to go in circles because there are so many details to keep in mind.
In my experience the best way to handle it is to do something like this:
Size up the discrepancy
Scan the data and check if there is an obvious segment missing, sometimes obvious patterns are readily identifiable
Ask the system owners for their opinions, they might know something you don’t
Start segmenting out chunks of the data. Can you get the systems to match for specific subsets? Like, all paying users, or everyone in December, etc. If you get those to match, note it down.
When you run out of ideas for sub-segments that match up, take a step back, see if there are any patterns and holes.
If that fails, start digging through the logic and understand what is going on, use your domain knowledge for where things could potentially go wrong
When you find a source of discrepancy, go back and see how much of the discrepancy it accounts for. Sometimes it accounts for enough that you’re satisfied.