Normconf is this week! Please join us in watching (and chatting ont he slack) live on the 15th! Last week, all the lightning talks had been made public. I had mentioned that my talk, ”Everything is on fire and you should contribute” in last week’s post, but I’m gong to spend this week expanding on the talk a bit since you can only cram so much into 5 minutes.
The core idea I wanted to express while submitting the talk proposal was “everyone’s code breaks and you shouldn’t feel bad when yours does”. Data science is a giant mishmash of people from many varied backgrounds — most of us are NOT from a software engineering background. For professional software engineers and people who run and manage the servers, personally doing something that brings down the production environment is a rite of passage. For everyone who’s self taught like myself, or at best coming from a tech-adjacent role, there’s a lot of fear surrounding “but what if I mess things up?”
While there are always exceptions, in good engineering cultures, breaking something critically important is seen as an inevitable part of the job. It should be seen as an important learning opportunity for the individual and the organization as a whole and not something to be feared or feel ashamed of. Blameless post mortems are specifically designed so that engineers can safely learn how to improve processes and software to avoid repeating incidents a second time.
Similarly, the aviation industry’s incident reports are also written blamelessly and we owe much of the safety record of aviation to that culture of learning from incidents. If you like hearing about incidents, I follow this YouTube channel that goes over many incident reports.
Given the varied background of data scientists, I’m positive that there are many people out there who hasn’t really been exposed to this important cultural aspect of working with software and complicated systems. So, I gave a talk about it.
Everything is on fire, everywhere
The initial inspiration that gave me the concept of my talk was my work email box. At a semi-regular cadence, I get an actual flood of emails, 20-40 at a time, all from the automated “your datawarehouse pipeline has failed, here’s the stack trace” emails that get sent to my team. This happens every day.
And it’s not like I’m working at a small tech startup like I used to where I’m the only data person and there’s not enough engineering support to help me run the data infrastructure. This is at Google Cloud, which has more resources and internal knowledge of how to build and run highly available systems than almost anywhere else on the planet. There’s a team that built and runs the data warehouse, there’s other quant researchers who try to fix their pipelines if their stuff breaks, there’s even an on-call rotation that has fixing pipelines as part of their duty rotations. It’s clear that having resources, knowledge, experience, and time isn’t enough to prevent a constant stream of data infrastructure breaking.
Maintaining code and systems is like fighting against the 2nd law of thermodynamics — chaos will happen and things will find a way to break.
I found that notion to be surprisingly freeing. Why worry so much about making sure I can think of ALL the edge cases that might break my code, from people changing the data tables, to handling upstream processing delays and backfilling, when I can just handle the obvious ones that are part of standard local engineering practice and then just fix the bug when it inevitably breaks. My code isn’t usually sitting on some critical path, and certainly no one’s life is ever in danger if my dashboards are broken for a couple of days. The risk of failure is so ridiculously low, why does it sit so large in my mind all the time?!
Not letting the perfect get in the way of the good is also apparently very good for stress and imposter syndrome.
For much of my career, I was just some random data nerd who was self-taught in programming and technical things. I would start with writing queries, then eventually graduate to cron jobs that lived outside of the general engineering framework. Finally I get to the point where my stuff becomes “load bearing” enough that I can’t just run it on a random laptop. But who was I to start push my sloppy untested code into production?
Well…. turns out that’s what everyone else effectively does anyways. The main differentiation between people who push production code constantly and “amateurs” like myself is merely one of experience and process. Their code is better because the author and their code has survived more bugs than mine has — because they’ve both been exposed to more pathological situations.
We don't live in an age where our software comes with a mathematical proof that the algorithm guarantees that our computations will always be correct. I’m not sure if that age ever really existed outside of the theoretical (a perfectly implemented software algo can still break due to hardware implementation details), but it certainly isn’t true now.
The many reasons that stuff breaks
Due to the length of the lightning talk, I had to very quickly gloss over the many different ways that things can fail. I only really devoted one slide to the “typical” things that cause stuff to fail in the day to day:
Someone making a change somewhere that ultimately affects your code
An assumption made in code that wasn’t actually true
A bug being introduced during a code change
I glossed over these because to a certain extent, these are the causes of failure that are the most common and everyone’s most likely encountered. They are also failure conditions that we can do something about to mitigate. To varying degrees of success, engineering cultures around the world have developed techniques to mitigate damage from these sources because they’re so common and thus foreseeable.
For example, all the talk about “data contracts” and “data meshes” that is has been going on in the data world in the past couple of years is a direct attempt to mitigate the issue of “someone changes something somewhere (often an upstream data table or API) and breaks downstream data systems”. They propose differing viewpoints and proposed solutions to solving the problem. Whether you have informal agreements to notify each other of changes in place, or code-level enforcement of certain data conventions complete with blocking unit and integration tests, it is all working towards that goal.
Similarly, you can adopt strategies along the lines of what’s called “defensive programming”. For example, even if you expect a field to never really return NULL in any situation, you might still incorporate an explicit check and special handling for NULLs appearing because they can pop up unexpectedly. This can help make your code more resistant to certain foreseeable pathological cases. These practices often grow out of the engineering experience of putting out fires and implementing procedures to avoid future fires. We often learn these from either bitter experience or by example.
All this stuff is a cultural phenomenon that is built up within individuals and the organization as a whole over time. There’s always new stuff to learn and innovate upon because we’re always finding new ways to break stuff.
As data scientists, part of the craft of doing the engineering side of this work is to engage with this social/technical process and be a part of it. Write code, see it catch on fire, put out the fire and learn for next time. Repeat. Indefinitely. There isn’t a way to improve otherwise.
It probably wasn’t clear in the lightning talk, but this is the the “work” part of contributing to the general firestorm that is releasing code.
I was probably asking a bit too much of that one slide.
Then there’s stuff you can’t prevent
Instead of dwelling on all that stuff above about building a culture of good coding practice, I went with the more interesting line of “even if you did all that controllable stuff perfectly, the world will STILL find ways trip you up”. Solving the human problem doesn’t solve all the problems!
Governments making arbitrary decisions that have deep implications for computer systems was a common theme. Between constantly changing time zone rules, to passing laws that can micromanage, require, or ban certain computing practices, the whole area is just ripe with examples of how your code is completely at the mercy of outside human forces. Just imagine if Indiana had wound up that bill that would have declares pi to be equal to 3 in 1897.
Then there are natural phenomena that can cause chaos in data systems, like my favorite topic in the world — time and leap seconds. But there’s also other examples like how a microwave was the explanation for data artifact in radio astronomy data that had been baffling scientists. Another fun one is how LIGO, the super-sensitive instrument that can detect gravity waves of black holes merging, is capable of sensing increased traffic nearby during rush hours. Or you can just have a squirrel chew up your fiber optic line and bring down production in fun networking ways.
Fun fact: literally within days after I had submitted my talk recording, the CGPM voted to pause the use of leap seconds on or before 2035 for 100 years. Part of it seemed to be due to the fear that there was the possibility of a negative leap second causing trouble, on top of all the trouble a normal leap second causes. This decision largely rendered my last slide obsolete by 2035.
I’m sure that very few of these sorts of incidents were planned for when the instruments and systems were first designed. They just happened, surprised us, and now we know a bit better. There will always be more. No amount of creative “I’ll wrap everything in giant TRY blocks!” energy will save you.
I wanted the audience to appreciate the fact that nature and the world at large is much more creative than any of us can imagine. That’s part of the joy of living in this big complex world. It also means our code will constantly be on fire.
But that’s OK. We all know it’s going to happen.
Contribute your code, it’ll be fine!
If you’re looking to (re)connect with Data Twitter
Please reference these crowdsourced spreadsheets and feel free to contribute to them.
A list of data hangouts - Mostly Slack and Discord servers where data folk hang out
A crowdsourced list of Mastodon accounts of Data Twitter folk - it’s a big list of accounts that people have contributed to of data folk who are now on Mastodon that you can import and auto-follow to reboot your timeline
Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.
New thing: I’m also considering occasionally hosting guests posts written by other people. If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.
Support the newsletter:
This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Tweet me with comments and questions
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!
Long, long ago in a universe far, far away, we had a data center with 7? 8? big Univac machines. We installed Data West array processor cards into each of them. Then our FFT code began crashing randomly. After much amazing detective work by some really brilliant people, it was discovered that RF interference between adjacent cards was flipping bits. Copper plates were installed between the cards and the problem was solved. I should note that these FFT jobs would run for many hours, so the code was set up with checkpoint/restart capability, because jobs crash. There was a whole team devoted to "bomb repair" that worked 24x7. Some things never change.
I figured out way too late in life that there is no substitute for trying things, making mistakes, and adjusting. There are no shortcuts. Often, we use this as a buffer and it prevents us from doing any hard work. It's not "doing the right thing" that gets us there, but training ourselves to perform better.