Learning what actually underpins your data

Involves forming an often bad opinions

May 09, 2023

Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.

Early-bird pricing of $25 (vs $35) ends on May 15th for Quant UX Con 2023 on June 14-15th! If you’re interested in using quantitative/data science methods to understand users, this is your conference. I’m not speaking nor organizing, but I do know a bunch of the organizers. The tickets support the conference, but if you can’t afford it for any reason, there are free tickets available for people who need them.

A flowering tree on a hill. Because hills were important to this week’s story

Every once in a while, I have to go and drop off the kiddo at school. It’s a fairly uneventful 20 minute drive and so I usually do two things. Once is let my mind wander a bit, often bouncing around ideas for what to write about in this newsletter (idle time, a.k.a. not playing games at the computer, really is important for creative work).

The other thing I find myself doing is paying attention to the fuel efficiency meter provided by the car’s information system. Since there’s not much else to do, I play with how I drive to try to maximize the efficiency score while still keeping up with the flow traffic. While hyper-milers obsess over various efficient driving techniques, the most low effort, high return stuff seems to come from not driving far above the speed limit and driving behind large vehicles (because wind resistance is proportional to the square of velocity), using the accelerator in bursts that let modern engine shutoff systems save fuel while coasting at speed, and minimizing brake usage.

For a long period of time, one thing that nagged at me was how I’d only manage roughly 28mpg going to the kid’s school, and then get around 32mpg back. This was done over many trips over a year, and doesn’t count abnormal traffic or rainy days which all lower efficiency a ton. Even if you allow for variation in traffic patterns, a 12% difference between the two is just too large to explain easily.

I suspect that most readers would have a pretty solid guess as to what was causing the discrepancy between the two legs of the journey — there’s an altitude difference between home and the school. It took more energy to go to school uphill, but I’d reclaim most of that gravitational energy on the way home. It actually took me a couple of months of time idly pondering over the efficiency numbers before it occurred to me what the cause was.

Luckily, Google Earth has a feature for providing the elevation profile of a path. So I traced out the route I normally take and this is what we get:

The elevation profile of the way to the kiddo’s school

The route covers everything from sea level to about 215ft. There’s a bunch of elevation spikes along the way that I’m pretty sure are artifacts of my sloppy mouse tracing the line and accidentally wandering off onto a hilltop or something. I’m pretty sure road engineers don’t just insert 50ft jumps along major routes.

A good example of indirect measurement

What I find interesting about this silly game of reading a fuel efficiency meter is that it’s a pretty good practical example of how an underlying feature, elevation in this case, can have biasing effects on our data. This is the same as if we were measuring the latency of a web app but fail to take into account details like physical network topology (e.g the speed of light) or outages.

These issues are often only really known to domain experts of a given measurement system because they’re the only people who have been looking at the data with enough care to find oddities that they couldn’t explain. For example, the LIGO team (a.k.a. the people who study gravitational waves with hyper-accurate lasers) wrote a paper about mitigations of environmental noise for aLIGO. They mention all sorts of sources for noise including temperature, vibration from a bunch of sources including wind, truck traffic, and even cars driving on bumps and gravel. No one outside of that team would be able to come up with such a list.

People who regularly think about fuel efficiency would likely have immediately picked up on elevation being a factor in my case, meanwhile I had to slowly bumble to the answer over time. But, we all have to start somewhere. What are some good ways to start engaging with a data set to get to the point of being able to notice these things.

Deciding what data is “supposed” to look like

If I hand you a sequence of numbers and tell you that it’s a single column of data, with no other context, the only thing you can really do is just take the data, nod, and wonder if I’ve lost my mind. Sure, you could run basic descriptive statistics on the number sequence, but there’s not much else you can do with it. You most certainly aren’t in any position to have any expectation for what the numbers should look like.

Having a hypothesis for what a set of data is supposed to look like, even a poorly thought out hypothesis, is the start for actually finding issues in data sets. For example, I had been asking myself “why aren’t the fuel efficiency numbers the same on both legs of the trip? Was I doing anything differently each leg?”. In hindsight it was completely off the mark, but it provided a framework with which I could say “my data doesn’t look right”. You can’t really start anywhere until you form some opinions to bounce ideas off of.

Same goes for the folks at LIGO. They were obviously comparing the various curves on their charts against theoretical noise-free ideals, but the existence of spikes and wiggles on their instrument readouts indicated that there were issues that needed to be ironed out in the equipment and processing architecture.

The important part is to come up with a theory of how your data should look, even if it’s an incorrect theory. Ideally this is based upon whatever domain knowledge you have available, or just a plain hunch. Maybe your data should follow a population distribution of a location, or perhaps the product is only used during certain points of the day. Whatever that source of belief stems from, you’re going to then think about the implications on the data. IF my gravitational wave detecting lasers were perfect at their job, then I’d expect to see this chart. IF my customers use my product once a day during their lunch break, then I’d expect to see this usage pattern. Given some assumptions about how data is generated, collected, then analyzed, we expect to see certain results.

You can then see if your data actually fits those beliefs and implications. When they inevitably don’t align, you can then try to figure out where the problem lies. Is there a problem with the data, or with your beliefs and assumptions? There’s probably an answer out there somewhere. It’s our job to figure it out.

If you’re looking to (re)connect with Data Twitter

Please reference these crowdsourced spreadsheets and feel free to contribute to them.

A list of data hangouts - Mostly Slack and Discord servers where data folk hang out
A crowdsourced list of Mastodon accounts of Data Twitter folk - it’s a big list of accounts that people have contributed to of data folk who are now on Mastodon that you can import and auto-follow to reboot your timeline
There’s a bunch of data people on Bluesky right now, but it’s a weird invite-only meme-fest so IF that starts becoming a thing in the future I’ll start mentioning it here.

Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.

Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.

Support the newsletter:

This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:

Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Tweet me with comments and questions
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!

4 Comments

Michael Wexler

May 9, 2023Liked by Randy Au

As I read, I kept waiting for you to mention Bayesian Inference. The entire Bayesian approach is "attempt to describe the thing that generated your data... which we call a model." That is, Freq assumes some distributions that drive the data thanks to some big N patterns (pretend it's gaussian and it'll be ok), while Bayesian says "that's cool, but allow smaller N and focus on that which generates the data, rather than the data itself". Bayesian also emphasizes using what you know (or think you know) about the data to nudge the model in the "right" direction... and if it doesn't fit well, your nudge could be wrong. If you haven't spent much time looking at this stuff, you may enjoy it, as you already think in that way.

I am always frustrated when analysts don't stop to ask how their data was generated before doing a large analysis. So much wasted work can be avoided if one understands the basics of the process, what was intended, and what was actually tracked/logged. An analysis can be 100% correctly done with appropriate checks for assumptions violations and outliers... and yet be 100% wrong because the data was collected badly, misinterpreted, and non-reflective of the underlying process. And just because it's "digital" doesn't make the data any better on it's own, it just makes it easier to collect bad data in volume.

Expand full comment

1 reply by Randy Au

Alan Jackson

I would venture to say that all data analysis has to be done with respect to a model, and the model should contain both signal (what we want to measure) and noise (what gets in the way). Of course, all models are wrong - it's just in what way and to what degree. Sometimes noise can become signal - depending on what we want to measure.

There is an oil well logging tool called a caliper. It simply uses a pair of arms to measure the width of the hole. Sometimes in a particular rock, like a soft shale, the rock will slough off creating an enlargement of the hole. If it gets big enough, the caliper will max out. Normally one would call those sections noise - all you know is that the hole diameter is larger than some value. But I used those sections once to pick out salt layers, since those all dissolved out. Noise became signal.

2 more comments...