Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.
I maaaaay have made a critical error in judgement impulse buying Baldur’s Gate 3 over the weekend.
One day at work, one of the other quant researchers posted a SQL query to the shared quant chat asking if one of us could take a look and just double check the logic behind it. The while somewhat complex, query generally looked fine. It used the correct filters to exclude the usual things we would want to exclude and wasn’t doing anything very exotic. But almost everyone who looked at the query pointed to one place that made us nervous and wary — there was a section that took an array of string tuples being used as a key-value store, unnested them into a table-like subquery, used logic to match against specific keys and values, and then joined some of the struct values back to the rest of the data in order to make it separate columns out of the values. The query seemed to produce correct-seeming results, but none of us were super confident about it at first sight.
Operating with such arrays and structs is always a bit awkward and usually depends on what your SQL dialect is, so it’s already a natural place to be wary of potential mistakes just from the relative unfamiliarity of the functions. But I think we were mostly worried about something much more mundane — depending on how exactly that struct array resolved into values that would become a “table” to join into the rest of the data, unwanted duplicate rows could be generated which would make a horrible mess out of most calculations downstream.
Important Note: the rows I’m describing as “duplicate” out of convenience here aren’t perfectly identical to one another. They usually differ by at least one field in an unexpected way and causes all the other fields to be identical in a way we don’t want. Like when we join our unique Users table to some facts and somehow User1234 comes up 4 times in the result set when we expected to see every user just once. I just don’t have a precise term for these types of rows.
The problem with such key-value arrays is that they’re intended to be catch-all fields. They’re to put miscellaneous one-off logging data for when it’s not reasonable to modify the schema for. Devs use them as arbitrary-length data stores and can shove as many key-value pairs in as they want, even change the contents of existing fields without notice. Just because a query is well behaved and has a 1-to-1 mapping for joining today doesn’t guarantee that some creative future change won’t break things and turn it into a 1-to-many relationship later.
While we were having this serious discussion about exactly what this query was doing and whether there was a chance it could break, I was also laughing at how everyone had been worried about the same duplicate row situation. It was clear that this was a shared trauma point across this very wide spectrum of data practitioners.
Duplicate rows, easy to solve, easy to make
If you’ve ever run any query that involved joining one or more tables together and counting things, you’ve probably come across situations where you had unintentionally introduced a bug and rows that you thought were singletons became multiple rows with mostly duplicated data. While the worst case is a full cartesian product, it’s often a more limited set where some rows are have unintentional duplicates while others don’t due to some missing filter condition in the logic.
As far as I can tell, some variation of this mishap has happened to every single person that regularly uses SQL to report things. If it hasn’t happened to you yet, don’t worry — your turn will come. The more unlucky of us would’ve had this happen to an important presentation of some sort.
This problem super common because, while it is a simple concept, it’s extremely difficult to guard against regardless of how much experience you have. “All” you need to do is make sure that your join conditions are always specified so that you’re joining the exact number of rows you want, usually one, to your data. If you can do this without ever making a mistake, you’ll be safe.
That essentially means, you need to clearly understand what relationships are 1-to-1, 1-to-many, many-to-many across your data. Knowledge of this at the table level isn’t enough either, you also need to know whether any relevant subsets of the many-to-many relations might become 1-to-1 under certain conditions. For example, are ‘refunds’ always one single row, or can there be multiple refunds for a given order?
In practice, it’s impossible to remain perfect doing all this. And mistakes usually happen when we’re rushing to complete some important task, like making a last minute chart for a board meeting presentation. That’s why everyone I know has some traumatic memory of making very visible mistakes where they miscounted data due to these duplicates and had to hastily retract and re-analyze their data.
Databases are living creatures out to get us
You might think that for a static database schema, you’ll eventually be able to learn all the ins and outs of all the relationships and get to a point where you can write even complex join queries without making mistakes. Sure, it’s definitely possible, but implausible.
Databases are usually living, changing, creatures. Tables and fields get added, business logic that generates entries change over time. New features are created, data input bugs will be introduced. Even assumptions like “this ID is unique” might be a lie — I’ve seen bugged production tables with duplicated “primary” keys. Tables that used to join 1-to-1 might have evolved into 1-to-many at some point unbeknownst to you. Queries that worked a year ago might not work in the exact same way today.
I’ve seen tables that used to track the completion of a state, like say sales completed in a 1-to-1 manner with orders, become used to store multiple states of an order. The devs just decided to add a new “status” field and suddenly it’s now a many-to-1 order relationship table. Queries would inevitably break on launch day despite engineers warning all the data folk to migrate all the reports because no one could find all the reports that assumed the old relationship.
And this happened for a major database migration and feature launch. Engineering knew it was a breaking change and actively made sure people were prepared for it. My original example up top didn’t even get that sort of treatment because it’s a miscellaneous field used specifically for arbitrary extensibility. There usually is no formalized roll-out process for the use and change of such a field. There’s no guarantee that the values you see in the field today are the same ones you’ll see tomorrow, or a month ago.
There’s never a warning when something breaks your query assumptions. There’s always the specter of a nasty surprise waiting for you. Yay!
Giving up and using DISTINCT
Usually, the solution to fixing a query that has unwanted duplication is to go back into the query conditions and get rid of the extraneous rows, a.k.a. “doing it right”. Despite it being the best solution, I’ve found myself in situations where there was no clean way to make the joins “right” due to quirks in the tables and the specific question I was trying to answer. A “correct” solution in such situations might require building complex temporary tables and aggregations before pulling everything together, which results in unacceptably slow performance.
So the backup strategy is to just sidestep the duplicates by counting distinct items instead. For example, if I was doing some really complex sales report, I could always just count the unique number of ORDER_IDs in my result set to get a count of orders, even if I had a duplicated rows from joining in every item purchased in every order. Oftentimes, I’ll compare a count(*)
with count(distinct id)
just to check if there are unexpected duplicates in my data.
Sometimes it’s simply more efficient to give up and not bother spending another couple of hours deciphering the join to figure out how to fix a complex query. If it gets the correct answer you need, it’s good enough. While working with distinct counts limits the number of downstream operations you can make — for example you can’t sum distinct counts up and arrive at a sensible result — you might not care about the limitations in the context of the current work.
Anyways, this happens to everyone
I wanted to write about this common issue because it’s one of those mistakes that everyone makes all the time. But every time it happens to me, I always start wondering why I didn’t catch such an obviously silly mistake earlier when the truth is that no matter how careful you are, it’s probably going to happen once in a while. That’s the only explanation for why all the other data folk I know have similar alarm bells going on for a similar piece of a query. It’s NOT trivial nor obvious. So I really shouldn’t feel too bad about it. And neither should you.
Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.
Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.
Support the newsletter:
This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!
This great fear is with me today, so I appreciate the timelines and solidarity. I am resigning to count district to get the important task out the door and trying to fix later...
I had a similar situation recently with a GIS join. I had a map of stuff by region, and another map of stuff by counties. I thought that the regions followed the county outlines, and they did - mostly. But tiny mismatches along the boundaries bit me and I created duplicate records when doing a geographic join. Sigh.