It’s been a while due to surgery and chaos, but we’re back on track with this is the bi-weekly(ish) post to paid subscribers where Randy writes about whatever’s interesting on his mind. This time it’s some semi-philosophical thoughts on just how there’s so much we choose to not know about our data and the connections to the world we are using numbers to represent.
Every day, we’re surrounded by all sorts of numbers. Many of them are familiar values that most people are familiar with and use every day. A cup of water here, a gallon of milk there, a pound of pasta, a few hundred dollars, a ten pound bag of flour. Then there are the numbers that are so big that no human can really grasp without specialized mental aids and metaphors, like a googol, the actual distance of a light-year, a billion dollars, the number of grains of rice in a fifty pound bag.
But as I go through life, every so often, I’ll see a number somewhere that sits somewhere between the two extremes. They’re often numbers that, by the raw quantity are familiar, but they’re of things that I just don’t have the necessary understanding to grasp.
For example, take the headline of “Spain sending 10 Leopard tanks”. I can easily visualize the vehicles because counting to 10 is easy and I’ve a rough sense of how BIG a tank is from seeing individual examples in museums. But I have absolutely no understanding as to what the marginal utility of a tank is to an army. It must be meaningful, or otherwise countries wouldn’t send them, but the real implications are a complete mystery.
Amusingly, this lack of understanding is one reason I often lose at grand 4x strategy games like Stellaris — since I don’t know what 1, 10, 100 tanks or interstellar warships can do and don’t know how much to build, I get crushed by even easy CPU opponents.
This sense of numeric vertigo also stretches to larger scales of everyday things in specialized situations. For example, consider a supermarket. Imagine the sheer amount of STUFF that goes in and out of the building every day. Every piece of food bought by a customer, multiplied by hundreds, thousands of customers a day. And there is still stuff left on the shelves at closing each day. How much volume, weight, cost is coming in and leaving? Only the owners and managers really know the answer, and even they need to use the power of mathematical abstraction to work with it. The tangible reality of what it means to cart 5 boxes of lettuce into storage and how many people it could feed, how much goes to waste, is something that extremely few people have experience with and I am not one of them.
I’m sure you’re reading this and thinking — but the abstraction is the point! The whole point of even basic mathematics is that it allows us to work with numbers and manipulate the hundred thousand screws, the gigatons of carbon in the air, the exabytes of bits in The Cloud. It lets us find patterns and relationships, make inferences, and discover truths that aren’t obvious if we’re shuffling stacks silver coins or riding on our one personal bicycle.
I’m sure that thousands, millions of data folk will go through their entire careers working on numbers in their domains and never worry that they don’t grasp the scope and scale of the numbers they manipulate. I don’t think that’s wrong either. It’s probably the most natural and correct thing to do. Civilized life requires the specialization of knowledge and labor. Most people don’t worry about how their car, their sewer, their network, or their windows work either and live perfectly meaningful lives.
But such worries are probably natural for an oddball that IS interested in knowing how all those things and more, work, and as someone who sits in the data world always pushing people to understand the actual things that underly the numbers we manipulate every day.
I keep writing and talking about the importance of domain knowledge on here, but I’ve always had a lot of trouble explain how one goes about learning domain knowledge.
Sure, you can get quite far through digging through the technical aspects of data collection and how systems work. Most of us learn the quirks of a system through trial and error, deeply engaging with the numbers of the system and observing the patterns and changes within. Orders move this way, money moves that way, some sales are counted the day the item are shipped while others are counted when money changes hands.
But we can only learn so much from looking at the mechanical workings of a system. Just like I can observe how my car works because some tubes that must be fuel go in one way, air comes in another set of tubes, and then a bunch of wheels and belts move. I can use that knowledge to do basic troubleshooting (like, “engine no worky”). I can even guess how to measure things like fuel consumption and efficiency based on that stuff. It’s impossible for the data to say anything about the million design considerations that went into putting that specific engine into the car, with all the connections to everything else.
I always wonder, exactly how much more of the story am I not seeing because all I see are the numbers? I might not even fully grasp the implications of the numbers I do see. For example, once worked at a children’s clothing e-commerce site. I saw hundreds of returns get processed through the system. It was perfectly normal and I’m sure that almost all of them had very mundane stories as to why the return was made. But out of the many thousands of rows hiding in the database, one might tell a story reminiscent of that infamous short story: “For sale, baby shoes, never worn”.
The question of “what does this number, this row, this pattern, mean” has no real end to it. We’d run out of time to do anything if we treated every data point as the unique thing that it is. But it is my belief that always having the question lurking in the back of my mind, making me really wonder about the context and humans that those data points represent make me a better data analyst. Because these doubts occasionally make a flash in my mind and prompts me to ask important questions like “what is being counted here, and why?”.