The past couple of weeks, I’ve been endlessly working on repairing, maintaining, painting, and otherwise fixing stuff related to the ~80 year old stack of bricks that I call home, and in the breaks between the physical labor, I was marveling at how architects and designers laid out the floor plans to my home and every other building in the world so long ago and everyone today attempts to do the best to use or modify the structure.
While anything can be changed with enough money, most houses in the city are slowly modified bit by bit, upgrading electric circuits and pipes here, replacing heating systems there, moving some walls in one decade, while adding a closet in another. Different periods had different conceptions of how big a bedroom, kitchen, or living room should be, and you can often date a place by its layout. But despite all the changes that happen over time, much of the original structure remains because it’s too expensive to change.
I find all of that to have interesting parallels to the relationship that us data scientists have with our data infrastructure.
When it comes to data infrastructure, for many of us there are two states to live in.
For those who are starting off from near-scratch you get to survey the tool space and pick whatever architecture and tech seems most promising. These folk get to build a complete new house and make all the fun design decisions that incorporate the latest innovations and comforts. It’s a fun process but you have to struggle with needing to have an opinion about how you’re going to be using data in the future. Just like deciding whether you want three bedrooms with en-suite bathrooms or use the space for five bedrooms instead.
For the rest of us, we largely make do with the infrastructure that we are given. We’ve inherited an existing, functioning house built by a previous generation, perhaps to specifications long in the past. Some bits of the infrastructure might be falling apart and need help but enough that the business still runs. There’s that one room that doesn’t quite get enough heat in the winter and no amount of messing with the furnace seems to help so that room gets a nice space heater added to it.
I’m sure you can think of any number of data things you’ve had to do to “make the best of the infrastructure you’ve got”. If somehow you don’t have an A/B test framework but an incremental launch framework, maybe you’ve launched a test version at 50% for a couple of weeks and watched metrics. That list of ETL jobs running off of a single machine with cron is “good enough” for regular email reports. The data warehouse is a 10 petabyte mess of files on a massive Hadoop cluster, but the query processing nodes are running a modern SQL engine on top.
“Making do” lives in the shadows
If you’re big into watching Data Twitter like I am, you’ll probably notice that we collectively love talking about new stuff — it’s part of our tech DNA to get excited about new things. Nothing gets us talking like a debate on what “data mesh” is, or the latest features of a hot new cloud database or DAG tool.
At the same time, we almost never hear the word “Hadoop” any more (I tried searching “Hadoop” in Google News this week and only saw garbage spam articles). We definitely don’t hear about people running analytics on a big Oracle database, though I’m sure someone must be doing it somewhere. I’ve heard customer stories about how petabytes of data are stashed in giant Hadoop clusters and powering analytics at giant companies.
At some point these companies might decide it’s worth whatever cost it takes to migrate their data to some different architecture and tech stack, but until then everyone working for them will go about their lives using the existing infrastructure. When they hit upon some tricky aspect where the existing architecture fails them, they’ll usually be able figure out some way to handle it.
But since the spotlight has moved on to other things, the most common time you’ll hear about these “legacy technologies” is when companies migrate off of them like Twitter moving a bunch of stuff from HDFS into BigQuery. All the hundreds, thousands of work-hours going into using the old system, hitting bottlenecks, trying workarounds, then finally deciding that migrating to a new architecture was necessary. That’s all glossed over with a “well, this wasn’t working so we’re using the New Thing(tm)”.
I think that’s sorta sad. I understand that a team moving on to new tech wouldn’t want to dwell on describing something that they’ve just thrown out, but for tons of teams using the old system, that was the house they were given and dedicated themselves into learning. Only a privileged few were in a position to decide that the collective pain had hit a critical point and it was time to re-architect everything.
It’s like how everyone in an apartment building just dealt with the slightly wobbly, occasionally out-of-service elevator for years until the building management replaced the whole elevator system. There’s no glory in the toil of walking up 5 floors due to a broken elevator — you’ll never see a blog post about it — but the people who had to deal with it for years would have stronger leg muscles to show for it.
“Making do” is an inescapable skill
But regardless of whether you start working at some company that has a big, unwieldy legacy data infrastructure footprint already, or starting out at some new startup using the hottest new tech, you’re eventually going to hit the rough edges of the infrastructure and need to make do.
No building is ever built perfectly because the needs of the occupants will change and the building itself will age with time. New tech will become old tech and be replaced by something else in the eternal hype cycle. Since there will never be enough resources to justify hopping to the next shiny thing in the hype cycle, the sheer inertia will force everyone to figure out a way to make their existing stuff do the things they need.
I find this process fascinating because, whether we like it or not, we all wind up learning this peculiar class of skill of navigating the quirks of our infrastructure. It’s not something that’s easily taught because every architecture I’ve ever seen is a unique beast pieced together from multiple products and tech stacks. No two are ever alike, and you need a lot of creativity to handle problems in the space while juggling speed, cost and efficiency. It’s a set of mostly thankless data engineering puzzles that falls into our lap at work every so often.
I also don't really know how to teach other people how to work on these problems. They’re truly like puzzles in that you often have to see the problem, then see what resources you have, and take a stab at a solution. And while there are various design patterns that pop up repeatedly for handling certain architectural problems, there’s no magical way to solve everything.
Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With excursions into other fun topics.
Curated archive of evergreen posts can be found at randyau.com.
Join the Approaching Significance Discord, where data folk hang out and can talk a bit about data, and a bit about everything else.
All photos/drawings used are taken/created by Randy unless otherwise noted.
Supporting this newsletter:
This newsletter is free, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Tweet me - Comments and questions are always welcome, they often inspire new posts
A small one-time donation at Ko-fi - Thanks to everyone who’s sent a small donation! I read every single note!
If shirts and swag are more your style there’s some here - There’s a plane w/ dots shirt available!