Seeing the data science work all around us

Everywhere! EVERYWHERE!

Oct 04, 2022

I have a recurring thing that I always wonder about. What does an architect see when they look at any building or space? What does a chef think when they are served food from another chef? What goes through a craftsperson’s mind when they see an example of their craft? It’s like how if you watch a sport like figure skating, one of the commentators is typically a former competitor in the sport and can point out (and explain) why certain techniques are difficult so that the audience can better appreciate what’s going on.

The reason I’m curious is because I often find that situations where the expert is either impressed or horrified tend to be extremely informative about what is considered difficult or easy and I just love getting a glimpse of that. I get a rush of endorphins when I’m reading someone’s translation and they pull off an elegant rendering of a really hard turn of phrase while making it look effortless, or when I see a horrible mess of a dataset sliced cleanly into causal relationships with some skillfully picked features, or when I spot a particularly clever snippet of SQL. I want to live that vicariously for fields that I’m not deeply familiar with.

But academic voyeurism aside, in the modern world we are literally surrounded by the artifacts of data science. We interact with multiple filter bubbles and recommendations systems when we listen to music, search for restaurants, shop for goods. We’ve got AI systems filtering our spam, helping us search the internet, and even helping write our code and emails. We’re bombarded with cookies, analytics, online survey popups, and every browser is part of some uncountable number of A/B tests.

For most people in the world, all that stuff is invisible as air. Their products always worked that way and even if things change all the time, it’s just another thing to complain about and move on with life. It appears as black boxes and so they rarely notice it, let alone thing about the insides.

This isn’t true for us data scientists. We very often DO see what is going on in software, or are at least somewhat aware that it’s happening if we stop to think about it.

A question I have for everyone reading this is… What do you see? What are you impressed by? What do you mock as being shoddy? What horrifies you? What are you learning?

Here’s some of the things that come to my mind when I look at the world and see the work of other data scientists, the details have been carefully anonymized to protect the identities of the mentioned.

I see, and try to manipulate, recommender systems

Probably the most prominent data science thing that I see and think about are recommendation systems because they're everywhere these days. While I don't even work in this space and haven't really worked on any part of a recsys in the past decade, they remain top of mind.

The biggest thing is simply being aware that these systems are very often slurping down tons of behavioral features and trying to cosine-similarity them against a population. It makes me very aware about what I like, swipe, click, buy, or otherwise interact with many sites because I just know that if I accidentally feed “the system” some spurious interest, it might follow me for ages. This has effects on how I use such systems, since I'll often avoid using certain features if I suspect they'll be used to make signals I don't want (for example auto play of videos).

What I find the most interesting is how the sweet spot for “just right” recommendations seems nigh impossible. Our lives are disjoint just enough from our devices that they systems can’t pick up the perfect content for a given time and place. So we get curious heuristics that fail in annoying ways.

Everyone knows about the phenomenon where if you order a toothbrush on Amazon, you’ll be recommended more toothbrushes for a days. It’s been an issue for years and you absolutely know that the people responsible for that system have heard about it and must have made countless tests to make it not happen. It’s extremely telling that the problem persists for so long, likely because they can’t get rid of that effect without making significantly less money.

Ads don't freak me out as much (though the industry itself still scares me)

Having a deeper knowledge of how adtech systems and techniques like re-targetting (where you visit Site A’s page and Site A’s ads appear to follow you across the internet for a few days) has also stopped me from being surprised at how ads for products can follow me across the internet. This behavior is often considered creepy by people who aren't familiar with how our data is sold and resold across the major ad networks (and how a handful of networks cover almost everything you see). Things can get associated via all sorts of metadata signals, features in the database, such as browser cookies, IP, location, etc. There’s always a chance that your info will overlap just enough with something else that ads start including you in a seemingly unrelated segment.

Sometimes seeing these things gives me a rough idea of what the retailer’s marketing strategy is, since what kinds of campaigns they run provides some hint as to where they believe they are getting their customers.

Surveys are Fascinating

Surveys are absolutely everywhere. These days it feels impossible to not trip over some popup box or email asking me for a couple of minutes of my time to fill out some survey or other.

I can't help but feel giddy whenever I have time to open up a survey and take a peek at what some organization is trying to understand. To be up front, writing good surveys is an extremely difficult task. Scientists spend years of time, over many iterations, refining their questionnaires for their particular research question. Industry folk like myself have to do it with a guess and a prayer because we only get a handful of attempts and little time for validation work.

Very often I just note the little technical bits like “oh, they're using a 5-point (or 100-point) scale to measure my happiness today”. These get filed into memory as inspiration in case I need to measure something similar in the future. Other times, I use the questions to peek inside organizational processes, like every time I’m asked to fill out a satisfaction survey after a customer service interaction I know that management likely implemented analytics that might be unreasonable.

Sometimes, from the series of questions involved, I’m left wondering whether the researcher was trying to prime me (e.g. get me to associate w/ a particular in-group like a political group or my age) as part of an experiment, or if they’re asking me 3 separate topics (for example, politics, financial savviness, then suddenly my health) because they’re pooling research questions from a fellow researcher in the same department. There’s never a way to tell, but I’m always wondering if someone’s trying to get at something sneaky.

All the above are mostly educational, or just interesting trivia. But there's a whole separate group of silly stuff.

For example, take NPS questions, now notorious for being complete wastes of time largely still being used because executives continue to buy into it. My favorite NPS survey of all time was from a dev or analyst that clearly stopped caring. It allowed three possible responses for the traditionally 10 point scale: “1-6”, “7-8”, and “9-10”. Savvy NPS folk will recognize that those correspond to the score buckets in the weird NPS methodology and it simplifies the calculation of the score to counting up the buckets and some arithmetic. It totally throws away any of the (supposed, largely debunked) validity of the measure for the pure convenience of the analyst. I honestly wish I could be so bold in my own work.

Other times, it’s just fun to stand back and admire poorly designed surveys. I’ve had multiple instances where I’ve gotten a phone notification from an app asking me what I thought about a recommendation the app had supposedly shown me recently. It strongly implies that I should know what it’s talking about, but the two events are so utterly divorced in time and space that I have absolutely no idea what they’re asking about. It’s such a horrific user experience that it makes me wonder who approved this to start with. A lot of methodological safeguards broke down in that study design to put it in front of end users, and I’m deeply curious what motivated it and if they’re getting anything useful at all.

Every large web site has probably been A/B tested, but that doesn’t make it good

The whole A/B testing methodology has become ubiquitous over the past 20 years. There’s plenty of frameworks and services that help people who don’t have access to data practitioners run and evaluate their own A/B tests. The main cost of running an A/B these days is mostly in the design and engineering work needed to build out a second version to test.

My assumption is that for most web sites that generate money and aren’t using a standardized e-commerce system, the owners have most likely done a various tests in an attempt to make more money. The more custom the site, the more likely they have the resources and motivation to do so. So I consider what I see in front of me as the current “this works the best for us” state of the world.

But that doesn’t mean those sites are particularly easy to use, or particularly better than an un-tested, straight-out-of-the-box e-commerce platform. It’s like when Amazon changes their checkout buttons from yellow to white-with-yellow-outlines — it probably did somehow help them squeeze an extra fraction of a percent in clicks (equating to many millions of dollars), but it doesn’t fundamentally change my life as a user. I see these everywhere and get a sense that I’m watching a page get slowly hyper-optimized around a local maxima and wonder if it’ll ever escape the local parameter space.

Please share some things you see

I’d really like to hear any stories you might have about interesting data science things that you’ve found in the wild that made you stop and think. The good and the bad, the clever and the horrendous. We all have different backgrounds and so different things jump out at us. Ideally in comments so that other people can see, but otherwise email or DM me and maybe I can anonymize and and share them at a later date.

Update 2022-10-04: I won’t name names, but one example I saw: when professional survey researchers see other researchers give patently bad “how to make a survey” advice… it gets can get spicy.

Also social scientists seeing economists discover stuff about people:

Jorge Morales @jorgemlg

It's actually hilarious that this abstract is not satire. These economists are mind-blown (along with PNAS editors) that people can reliably introspect their mental states, that "made-up" numbers track subjective states' intensity and that—grab a seat—mental states cause behavior

Brendan Nyhan @BrendanNyhan

Incredible. Almost art. Did no one at PNAS take a class on surveys? What is happening? https://t.co/usAqc0HfPJ https://t.co/NTmegrTrLE

Update 2022-10-05: And a well reasoned counterpoint to the above PNAS paper

Mark Fabian @MarkFabian_PAIS

I see a lot of psychologists smugly poking fun at this paper. OK, have a laugh. That's fair. But perhaps also recognise that a lot of people think feelings measures are imprecise BS. & who knows, maybe economists can help? 🧵👇 pnas.org/doi/full/10.10…

Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.

New thing: I’m planning to occasionally host guests posts written by other people. If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me.

About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.

Support the newsletter:

This newsletter is free and will continue to stay that way, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:

Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Tweet me with comments and questions
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!

Counting Stuff