The reasonable(?) effectiveness of data analysis

Why are we even effective at anything?

Apr 25, 2023

Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.

Can’t really have spring without cherry blossoms all over the place

Last week, Benn was asking whether data folk “are the Jared Kushner”, referencing how that guy decided to “solve” the policy problems of the middle east with the supreme confidence of someone who has no background history and no actual understanding. Meanwhile everyone who knows remotely anything about the history of the region, policy experts especially, were rolling their eyes hard enough to torque their heads into the ground.

It’s a pretty apt description of how us data folk can appear to outside stakeholders as we come and deploy our tools and methods to “help” teams in some way. I’ve certainly been airdropped (and sometimes inserted myself) into teams where my relevant past experience was effectively zero, but was there to “help out (somehow) with data”. Benn and many others have noted that every time we do these sorts of engagements (which are constantly happening), we run the risk of bumbling in like Jared Kushner and generally making asses of ourselves by saying things that the stakeholders already know and probably had dismissed long ago.

If we get really high on our own hubris, we might even risk sounding like a stereotypical physicist or economist trivializing other fields. Otherwise we might start acting like expensive external consultants who swoop in at huge cost, use cookie-cutter analyses done on overwrought spreadsheets to make recommendations that all sound about the same and are things the actual teams doing the work had already considered or suggested before.

Obviously (uh, I hope it’s obvious…), we don’t want to have such negative interactions with the people we work with. Work is a long term relationship where precious trust is hard earned and easily lost. To be successful, we need to strike a balance between respectful understanding that there is past context and work that we don’t know about but should learn from, while also having enough confidence to know when our own viewpoints and methods are saying something worth to sharing for discussion. Going too far to either side makes us ineffective in our work.

I’m not going to dwell on striking that balance between everything today. It’s an evergreen topic that we can visit plenty of times when something reminds me of it.

Instead, when I was reading Benn’s post, one thought that kept spinning around in my head was how, despite all the valid criticisms of how and why data analysis can get thrown at other fields “to help” like so much lukewarm spaghetti, it’s a miracle that it works more often than a casual look at the facts would suggest. The chances of a Kushner-like character having ANY useful impact on a situation should be practically zero.

So that train of thought reminded me of an old classic paper in the philosophy of mathematics and science, “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” by Eugene Wigner. The paper points out how math can be used to describe and even predict physical phenomenon. The paper opens with an person being incredulous that pi appears in the formulae for the Guassian distribution, because why would the ratio of the diameter to the circumference have anything to do with statistics. Why is it that sines, cosines, pi, and even imaginary numbers find their way deep into physics at all, let alone in elegant formulae? Just what lies at the heart of this relationship between mathematical concepts built out of axioms and logic, and the real world? It’s a fun topic to mull over on a rainy afternoon, if not for 60+ years of philosophical exploration.

I’m not a philosopher, but I’d like to occasionally play at being one on a newsletter, so let’s restrict things down to data science and analysis as applied to work. A much better constrained question. Why is it that we can be thrown into the work of other people, in a field we have zero experience in, and have any expectation of making any useful impact at all? When stated objectively, it sounds utterly ridiculous. But in my experience, a data team can find something to make an improvement on, even if the impact can sometimes be small.

What factors go into this?

The power of simplification

All data work starts with a fundamental act, counting stuff. I chose that phrase as the name of this newsletter specifically because it’s a really simple but profoundly nuanced act. The ability to count things is also the ability to very clearly define what should be counted, as well as what shouldn’t. That often requires a very good understanding of the underlying thing to be counted, as well as the planned future use of the numbers in analysis later on.

Put another way, it is a very nuanced form of simplification. We go from a wide array of objects that have infinite variation in properties, and we’re picking out a specific set of criteria to summarize it all down to a single number. Once something is counted, we can usually not think about many of the details that have been simplified away while doing mathematical operations on the numbers. We only have to recall the counting criteria when we come back and have to do interpretation of our models later.

But one less often considered feature about the process of counting is that it forces people to stop and come to a much better understanding of the thing to be counted in the first place. Forcing people to articulate what’s important to keep lets them figure out all the noisy features that can be ignored. Think of the times you’ve had an argument about whether something counted or not and gained a better understanding of how everything worked.

That’s probably a win before we even bother manipulating the numbers in models. Sometimes just that knowledge alone can help point to things that can be improved.

The power of spotting isomorphism

Many fledgling data scientists hold some sort of belief that much of our job is modeling our data — that is, taking the data points and shoving them through models like linear regression. The story goes that much of the power of data science lies in the models that we can use and apply to the data to get useful results out, either predictions, or inferences. AI, ML, predictive analytics, all that stuff are “things we can do to data to get value out”.

I’m not so sure about that.

Instead, I think the value that data folk bring isn’t the various models we have in our tool bag. It’s the ability to identify that a situation that we see in front of us is isomorphic to, meaning it bears the same abstract form as something else that we already have a model for. It can start with a simple selecting the appropriate model to describe and analyze a set of data, but the most powerful part is being able to recognize that if data is collected in a certain way, certain models that may be useful can be used. That ties back in with the whole “counting things well” point above.

Even if the model selection method is something as silly as “throw it into a linear regression first because usually that works” is a profoundly timesaving insight into the nature of problems we’re encounter in work settings. Because it’s true that many industry problems reduce to fairly mundane and linear relationships (to a first order of approximation). If all data tools are hammers, I might have fewer hammers than many people, but I’m very good at knowing exactly how to wield my hammers to good effect by recasting problems into familiar forms.

However we do it, once we’ve done the work of deciding how to model things, work becomes significantly easier. Models come with a bunch of pre-made analyses and procedures that can be quickly brought up to generate value, so identifying a suitable one means being able to get to work without having to painstakingly reinvent things from scratch.

The power of simply having time

Think about your current data science team. Do you have someone who is dedicating their work day to looking at processes, analyzing them, and coming up with improvements? More than likely the answer is no. There is plenty of other work to do and these self-reflective tasks are often left to a manager, or done occasionally on the side when process pain reaches some intolerable level. Even if there’s known improvements to be made, who has the stamina to see them through when other things are on fire?

Imagine what kind of process improvements and experiments could be done if someone could take a bunch of dedicated hours into figuring stuff out as their primary job. That's what data teams being dropped into other teams become in practice.

People often have plenty of ideas and hunches as to what can be improved, but they don't have permission to take the necessary time to fully flesh out the ideas or run tests. Data teams act as an enabler because they not only have explicit authorization to work on such problems to the exclusion of other work, but they can also create time an space for such work to the team they're helping in the form of interviews and workshop sessions.

Sure, other people with different skill sets like researchers, consultants, and cross functional tiger teams can all have this magical combination of time and mandate. But for better or worse, data people are very often voluntold for this sort of role because “everyone’s got data they need help with”. As a discipline grounded in generalist tools, we’re often not the perfect people for the job, but good enough.

It does help for finding work when things get slow, so I suppose I’m not complaining.

If you’re looking to (re)connect with Data Twitter

Please reference these crowdsourced spreadsheets and feel free to contribute to them.

A list of data hangouts - Mostly Slack and Discord servers where data folk hang out
A crowdsourced list of Mastodon accounts of Data Twitter folk - it’s a big list of accounts that people have contributed to of data folk who are now on Mastodon that you can import and auto-follow to reboot your timeline

Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.

Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.

Support the newsletter:

This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:

Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Tweet me with comments and questions
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!

Sharif Islam

The S.A.D Newsletter

Apr 25, 2023Liked by Randy Au

Great post. Agree with the power of simplification and spotting isomorphism. However, those usually come for experience. right? I guess that is also a big difference with JK -- he does not have the experience. Without experience I would not know how to simplify and spot pattern. Maybe this is what you were referring to here, "If all data tools are hammers, I might have fewer hammers than many people, but I’m very good at knowing exactly how to wield my hammers to good effect by recasting problems into familiar forms."

Expand full comment

1 reply by Randy Au

Richard Careaga

Two outta three ain’t bad

Full throated agreement here. Questions are much more important than answers because if a question is relevant to a type of fact (widgets/donkins, say) they remain the same even when the particulars of the facts change.

1 more comment...

Counting Stuff