Data DIY in the age of industrial pickaxes and shovels
Talking about using our tools instead of the tools themselves
Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.
I know much of the world is off enjoying the new Zelda game. Luckily I don’t have a Switch so I dodged that bullet. I just bought Persona 5 Royal instead and writing this is taking me away from my 75hr save file and final dungeon.
Last week, I took a detour talking about how custom keyboards are assembled and programmed. But towards the end, I mentioned how it had sparked some thoughts about data tooling, and here we are this week with that.
Currently, DIY keyboards is experiencing a bit of a boom in popularity along with the overall increase in popularity of mechanical keyboards. There’s a lot of activity around DIY kits that let you put together something to your desired specifications, as well as a vibrant community of people who would decide new and interesting things for fun. The hobby is a relatively inexpensive one as far as hobbies go, even something built with the fanciest parts typically only cost a few hundred bucks in total. That’s a far cry from most hobbies where the sky is the limit in terms of ways to burn cash.
As is tradition with DIY, people built their own things because the market wasn’t providing what they wanted. Specific keyboard layouts and switch type combinations weren’t available, so people made their own, even if it cost them a premium in time and parts to do so. But over time, manufacturers have drawn inspiration from the community and absorbed many of the more common features and layouts people like. Nowadays, someone who doesn’t want to solder can usually get something that suits their needs and the cost of a custom DIY thing will often overshadow a commercial offering. DIY can’t really compete on price due to economies of scale, but offer cool customizability.
I’ve seen this trend also happen with headphone audio equipment over the years too. In the mid 2000s, there was a super vibrant DIY community of people designing and building headphone amplifiers. I spent almost $300 in parts to build this hybrid tube amp for my nice headphones around 2009. It’s still great and I continue to use it 14 years later. But I had recently bought a $100 solid state amp that sounds 99% similar. There’s literally no reason to DIY a headphone amp at this price point any more. The commercial market with dedicated R&D budgets and scale have caught up to any boutique DIY-ing at that level. It’s good for consumers, but boring for tinkerers.
DIY still happens, but it’s for the niches that are too small for companies to make offerings around. So, sadly, much of the knowledge that’s needed to create new DIY designs just gets lost over time. Just like how original car pioneers used to build cars in their garages, while that is practically unheard of now.
Data is far along this path
Lucky for me, just last week, SeattleDataGuy had posted a retrospective on the past decade in data engineering. Well worth the read, partly for nostalgia for those of us who lived through that arc of time, but also for people unfamiliar to get the context of all that I’m talking about.
Back in the 2010s, being a data scientist often involved a lot of DIY tool-making. Whether it was spending hours hacking together MapReduce just in order to answer a simple count(*) on a few terabytes of log files. There’d be endless meetings with engineering to build out, launch and test a telemetry system to track events accurately and run A/B experiments. If we were honest with ourselves, a lot of our work back then involved a lot of toil reinventing slightly different-sized wheels for very similar problems.
During this time, everyone had their own little set of scripts for converting JSON to CSV, log parsing scripts, email report sending scripts, and other similar utilities. There was a lot of gap-filling going on,
Only with time and experience along the way did people start seeing the patterns between relevant problems and built tools to solve them. Some fraction of those tools would get released into the wild as products, either open source or closed, that we know of today — DAG tools, experimentation frameworks, analysis packages, all of them were just repeated problems that eventually got dealt with.
Most of us working during that period didn’t actually build nor contribute to the development of the tools that we now use all the time. But we all individually faced similar problems, discussed and shared gripes and solutions as part of the DIY community of data work.
But as the gold-rush of data work progressed, companies figured out there was lots of money to be made selling us data practitioners our pickaxes and shovels — the old adage that the real profiters of the California Gold Rush weren’t the gold prospectors but the people who sold tools to said prospectors. That business is infinitely more predictable than the actual business of using tools to dig for business gold. So we got our Redshift, BigQuery, Snowflake, dbt, and all our other big tools. These familiar tools offered us the clear value of making a lot of our work possible, if not easy, and they gained market adoption based on the value they provided.
Nowadays, just about every relevant data store we want has a SQL interface. Most analytical features we want in a database are also included across all vendors. If we want a A/B testing framework, there’s very likely one available for whatever frontend technology you’re using and the integration is (relatively) smooth. And while everyone still uses a haphazard mix of spreadsheets, R, Python, and notebooks, there’s not that much friction interoperating between tools and file formats these days.
DIY hacking is not nearly as necessary as it was 10 years ago. The places that once required a full on custom translation layer can now be handled by the software themselves. We’ve feature-requested away a lot of the disgusting grunt work needed to get our tools to interoperate. Work is seriously easier than it used to be thanks to these tools, and so a lot of the modern data science discussion online revolves around how our tools are evolving and what features they’re getting.
Readers might remember that I once wrote about how DS has a tool obsession. I still think that we are too obsessed about discussing our tools and not our methods. But it’s even worse in that we’re discussing tools-as-product-releases and rarely tools that we’ve made for ourselves.
So where has much of that DIY tooling energy gone? I think it’s been hidden away in the shadows and gaps between all the products we use now. It’s the little scripts that glue together our custom in-house systems. It’s the very small function that fixes some weird data formatting quirk that isn’t being handled right. It’s the obscenely complex plotting code needed to get a “simple” line chart done programmatically, with just the right visual tweaks, for a dashboard. It’s the 500 lines of pandas code used to wrestle data into a usable analysis for a decision. It’s the *gasp* code we use for “cleaning” data.
I find it sad that we spend all our time talking about the tools that big data analytics companies/software packages release, and so little on how we’re gluing them together into actual useful things. If we would stop being blinded by the latest shovel design from Big Data Co and talked about the interesting holes we’ve dug using our own shovels, we’d be a lot better off as a community.
But what’s good to share?
The insidious thing about sharing work is that, for most of us who aren’t out to become “influencers”, our own work looks boring, drab, and mundane. Why would anyone want to see the bash script I wrote to move data from one database to another? The code is ugly and not presentable anyway. Even this post itself, if you take away the stories about other hobbies, could be considered a nostalgia trip pining for the days when it’d take half a day to wrangle a stupid CSV from one database to another.
But the surprising part is that many things that are worth sharing do look boring and unimpressive. It’s the exact same phenomenon that stops a lot of people from blogging or writing about their own experiences — the stuff is only repetitive because you’ve been thinking about it a lot, probably more than anyone else has thought about it in the past month.
So yes, that janky bash script that sets up your notebook environment from scratch on a new machine, or the simple job-starting framework you glued together in bash, or that awesome SQL query that you use to turn event data into a pure-SQL funnel analysis are all worthy of talking about and sharing with other people. Maybe it’s through a blog post, or just a brown-bag talk. The code might be ugly, but someone in the world out there will appreciate knowing that they’re not the only other person who had to do similar things to get the job done.
I know I’d love to hear more of this stuff, somewhere, somehow.
Oh, and if you want you can write a guest post for this newsletter too.
Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.
Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.
Support the newsletter:
This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Tweet me with comments and questions
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!
As someone really late to the party (I've entered the professional work force 3 years ago) - this post hits really close to home. I work as a data analyst, but I have ambitions of becoming a data engineer, and my strategy has always been slowly eking out more and more technical work for myself in my current position. These shadows and gaps between products is where I live and work.
I was lucky to work in a place where there are a lot of gaps that I can fill, but every time a new system is being introduced, I am really anxious that this one is going to fill all the needs, and I will be instructed to go back to ad-hoc reports, Jupyter, KIbana dashboards and talking to business people.
But that's not here yet. I spent at least 3 hours last week poking around in GitHub actions to make sure that when a specific config for a specific system is changed, an email is shot out to a team which is downstream-dependent and has a weird release cycle where they rarely do e2e tests and if we change a lot, they will not find out until much later down the release train. It even seems like it functions.
Also, once I've built a periodic reporting system (if you want more engineering lingo - "last-mile ETL") entirely based on Jenkins, just because our team did not know about the existence of Airflow and thought that it would be too complex. Apparently, using Jenkins as a "glorified crontab" is widespread?
Tooling isn’t always a distraction. It can help rub the nose in the data deeply, engraving patterns and anomalies as some read transform send on process is designed tested and debugged. Before CAD emerged as a standard tool in the 80s, field geologists at the U.S. Geological Survey would spend much of the off season bent over drafting tables wielding colored pencils patiently coloring maps. This was not an obvious task for Ph.D. scientists.
What long experience had shown, however, is that those hours were among the most valuable in the report preparation process because of all the details that there was time to absorb and the time to be thinking about it in the back of the mind. That time spent can yield rich opportunities to detect the kinds of contradictions between data and theory that lead to paradigm shifts.
Nobody except maybe Brian Kernighan, remembers what purpose motivated Ken Thompson, Dennis Ritchie and their buddies at Bell Labs to tool up UNIX and C. It wasn’t in their job descriptions, no committee approved it and no project manager oversaw it.