The mythical single source of truth

We can't help but sell this one

Apr 04, 2023

Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.

A very peculiar mascot character in Shanghai, I think it’s supposed to represent the traditional old housing district the city had… or something

Last week’s paid subscriber post, I went on a bit of a rant about how “single pane of glass” dashboard requests are completely unhinged from all of practical reality. This led some friends on mastodon to taunt me into another mythical thing that gets invoked as a “data thing that we should have” — the single source of truth for [a metric or type of data].

The gist of the notion is this — instead of having multiple conflicting ways to measure something, for example, how many people visited our homepage in a day, there is a single data source that derives and stores the count of users so that no one else has to calculate it on their own. Having that work centralized means that we stop having those very awkward conversations about why the same metric seems to take on a bunch of different values depending on who you talk to.

It’s easy to see benefits in having a centralized approach. Just getting rid of reporting conflicts makes for smoother management and issue spotting. Think of all the hours wasted trying to audit the behaviors of two or more conflicting metrics. Teams don’t have to worry about reinventing the wheel, and anyone in charge of maintaining the code can choose to make improvements to the model. Everything is great!

In fact there’s plenty of tools around now that help data systems maintain their source of truth, tools like dbt that preserve and transmit the definitions of truth throughout the data infrastructure so that it often takes deliberate willful work to create conflicting definitions of things,

But single truths are often fictitious constructs

Let’s take the classic use case of trying to count how many users visited your web site yesterday. You could count the number of requests that are recorded in your web server’s log and treat unique IPs as users. Except multiple people can often share an IP, or your server logs are buggy. Maybe you could record visits using Javascript on the site, while using browser cookies to identify unique users. Except every browser has their own cookie, and some people flat out disable Javascript. Maybe you force users to log into the site and ignore all the other things, but adoption stopped growing once that policy was implemented. Maybe you just give up and install Google Analytics.

Whichever way you choose, all of them will give different counts, often differing by a large 10-15% margin. They will always conflict because the methods and definitions of what actually counts as “a user” differs. These numbers can never be truly reconciled. Out of all those different ways to answer the question of how many people have visited the site yesterday, none of them stand out epistemologically as being more true than the rest. Depending what question you’re choosing to answer, you will necessarily choose one method over all the others.

So the best that you can do is think about which definition is going to be most useful across the board for your typical questions, and anoint that way as “the source of truth for user counts”. It is a completely arbitrary decision. Truth for the organization is literally set by decree. It doesn’t mean that all the other things are “false”, just… different and unused.

All the alternative definitions for user counts will sit in the bottom of the analytical toolbox until one day something comes up that justifies pulling them out again.

Just about every meaningful data point can be measured in different ways

Measurement error is everywhere. I really struggle to come up with some kind of measurement that is perfectly unambiguous and free from error. Even something as simple as counting the number of iterations of a 4-line for loop :

i = 0 for x in range(10): i = i + 1 print i

Even super simple loop like this, if the program crashes after any of the 4 lines of the loop, the answer to the question of “how many times did this loop execute” can vary depending on your definition. This is regardless of the state of the variable i which is merely a convenient way of measuring the progression of the loop.

There’s no single source of truth, only preferred definitions of truth

For most of the day-to-day operations, you’ll want everyone to use the same database and metrics definitions to do all the reporting. This keeps things smooth and minimizes any conflicts between reports which can cause people to doubt the quality of the reporting. For end users throughout a company who aren’t going to live and breath the metrics foundations of the company, this is ideal. “Single source of truth” is the pretty, sleek, polished body panels on a shiny car — every piece fits neatly where its supposed to and you can’t imagine how it could fit together another way.

But someone’s gotta be the mechanic that knows how to remove the panels and manage the complicated mess underneath. That’s us, the data people. It’s dangerous for us to forget the existence of all the alternate definitions of metrics and truth. At some point, it’s necessary to pull an alternative definition out to fix a problem.

So we’re faced with the challenge of having to balance a position where we internally know that there’s Many Competing Definitions, but pretend that there’s the One True Definition while talking to everyone else. We have to be careful to not accidentally get confused ourselves and use the wrong definitions at the wrong moment.

Perhaps even more confusing, we might have to actually pitch the creation of the Single Source of Truth because there’s too much conflict and chaos going on. Even though we know that we’re going to immediately maintain a quiet section of alternative truths. We’ll have to build and define data warehouses and metrics definitions that everyone else will use after tons of work and careful thought. And then we’re going to have a long list of caveats that mean we quietly create alternative metrics for a rainy day far in the future.

Everything we build is based on useful white lies. For better or worse, we’re the arbiters of truth, the eyes and ears of the company.

And sometimes… just sometimes… other people realize that this is part of what we do.

If you’re looking to (re)connect with Data Twitter

Please reference these crowdsourced spreadsheets and feel free to contribute to them.

A list of data hangouts - Mostly Slack and Discord servers where data folk hang out
A crowdsourced list of Mastodon accounts of Data Twitter folk - it’s a big list of accounts that people have contributed to of data folk who are now on Mastodon that you can import and auto-follow to reboot your timeline

Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.

Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.

Support the newsletter:

This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:

Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Tweet me with comments and questions
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!

2 Comments

Alan Jackson

Apr 4, 2023Liked by Randy Au

Long time ago I was helping manage a seismic crew that was working in the Amazon. I dutifully generated burn plots of expenses vs. budget. My manager came to me one day rather upset. My burn rate was about $1 million lower than what the accounting department was telling him. Our average monthly burn was about a million. I dug into it and that is when I learned that accountants post costs in the month received, and so in the previous month had basically posted two months worth. Hence the discrepancy. Lesson learned.

Expand full comment

1 reply by Randy Au

1 more comment...