Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.
Recently at work, I had been given a whole week of largely uninterrupted time to work on anything that I wanted (within scope of my actual work and not more Baldur’s Gate 3, obviously). After some thinking, I decided to spend my time putting together something that would help make a repetitive analysis request significantly less painful.
The problem the tool was solving is straightforward stuff. I’m regularly asked to analyze data coming off of various web pages. Each page has lots of event data tracking all sorts of stuff, many of which aren’t interesting to me. So the usual process of running the analysis is largely something like this:
Dump all the events from the page out, manually identify the ones I care about
Take the list of events I want, write a query to pull their associated counts
Analyze
Very simple stuff. But it’s also something that you can’t automate away because of the “pick out events I care about” step. While there’s a lot of repetitive bits in the queries that could be copy/pasted from a templates, no one really ever bothered to make such a template. It always felt faster to just slap down an ad-hoc query from scratch than try to search for a SQL code snippet buried in my files.
Well, I took a about two days putting together a colab notebook that would store a list of events I don’t care about to pre-filter for me, take a list of page URLs to dump out the remaining events, then accept my list of events to generate an analysis query for me. Again, despite some silliness around programmatically generating SQL, very simple stuff.
In about two days time, I had cobbled up a very crude tool that largely accomplished what I set out to do. What did I do with the other three days that I had?
I spent most of the time making the tool usable by other people. You don’t spent a bunch of years studying users and helping build good products without learning a thing or two about stepping in the shoes of a user and at the least smoothing over the worst of the spaghetti code and out-of-order code blocks.
I spent the days massively refactoring and reorganizing the code, adding convenience features, tacking on extra lines of function-level documentation, markup, and notebook-wide instructions all in the name of better usability. By the end, the thing had clearly marked chunks of “Edit these!” blocks, a logical top down execution order, and instruction for users to follow. It was a tool that didn’t need my supervision or prior knowledge to use. The interface, its inputs and outputs, were all still quite primitive, but it’s nevertheless a tool. It’s a thing that was built for a specific purpose, and now I and others can use it to achieve that purpose. It can’t do anything else.
But that’s good enough. While I could easily see ways I could expand the usefulness of the tool, I wasn’t going to somehow try to turn it into some kind of product.
It’s too easy to forget where the bar for “tool” is
In daily life, we are surrounded by products — objects and services that are created and designed to be sold to us for various reasons. The vast majority of the things we buy are finished goods. My hair blow dryer, electric drill, and computer mouse all have pretty, molded plastic interfaces hiding lot of empty space along with a nest of circuit boards and wires inside. If you ever take anything apart, you’ll can see just how much thought and polish usually went into everything to give us an appealing, comfortable interface despite the inner workings being anything but.
This depth of design is true even for simple things like hammers, of which there are very many. If we try to compare the crude tools we build in a handful of days to even the simplest things we can purchase, it’s not even a contest. Most consumer products reached this level of design complexity thanks to market competition and years of design evolution.
Our stuff is crude because we’ve only spent at most a couple of hours working on them. Very few of us have any reason to build physical tools for our own use. Most don’t even have interest in watching other people using tools let alone make them. Even if we do have the skills and need to build tools for our own use, there’s a very stark contrast between what some amateur builder can accomplish in their home workshop compared to the manufacturing capabilities of a modern factory.
So I think that many of us, myself included, have lost (or never developed) a sense for what qualifies as a “tool” that we can share with others. We’re too used to seeing “feature complete” products with complex features, fancy integrations, and slick interfaces. All those big open source projects with thousands of man-hours behind them, all the multi-million dollar data startups, there’s no way my janky notebook is even in the same category.
But that’s entirely the point.
Tools don’t need to be anything other than able to accomplish their job. The reality is that even the floppy plastic fork from lunch that you re-used to help dig and repot an office plant is a tool. The scrap of cardboard put under an uneven table leg is also a tool. Same goes for my simplistic colab notebook with more support text than code.
Why does this recognition that something is a tool important? Because tools can be shared! Tools are often worth sharing! There are likely other people in your team, your organization, even the wider community, who would probably find what you built to be useful. Sure, there are a million “send an email from a python” examples out there, but the one you specifically wrote is concretely tested to work with your company’s email infrastructure. Anyone who reuses your work will save those hours of work you put in, and that’s valuable even though no one else outside the company will ever make use of it.
Back when data science was still an unknown phrase, most of us had to build our own tools to accomplish things because vendors didn’t really exist yet. Things have completely changed since then. We’re luckily not under pressure to invent those wheels any more. Lots of simple tasks have been absorbed by data products in their incessant need to scope creep their way into “the one central data platform that no one ever uses for that purpose”. But I think the pendulum has swung a bit too far in the “build or buy” analysis. We’re a bit overdue for a revival of looking at our own humble hacks and sharing it with other people. This is especially true in a economic climate where you can’t just buy your way out of every little problem.
But as a nice side effect of how better tools have become broadly available, we’re not building scoops and buckets anymore, we’re moving on to mortaring bricks together and sharing those designs out. And while it feels weird to share out stuff that “isn’t polished”, it’s probably more polished than is truly necessary.
It’s OK to not worry about “done”-ness
As with all projects, there’s always a compulsion to make things “ready” for other people to use. The documentation needs a rework. Those //TODO comments need to be cleared. Maybe the whole thing is spaghetti code that not even you can unravel any more. Every one of these sounds like a reason to not share your work with anyone else because you don’t want such a mess to be attributed to you.
But I’m here to encourage you to share stuff out much sooner than you feel comfortable with. So long as someone can understand how to use your tool without you having to explain it to them in person, it’s probably good enough to at least share with some trusted coworkers. At least those people can talk to you if they have trouble. In most instances, that is more than enough because it might just work well enough for them anyways.
And if they come across flaws and bugs, then at least you have a reason to go and fix them because you know someone is actually using it. That certainly beats doing it all before sharing, only to find out no one needs the tool to begin with.
Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.
Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.
Support the newsletter:
This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!