Failure comes for us all

Gotta learn to handle it

Jul 18, 2023

Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.

Just a tree in a snowy part in upstate New York, many years ago… while I’m sweating in the summer here

Last week, a reader who is an early career data analyst emailed me a question. Paraphrased a bit, how do you deal with the pressure and anxiety of potentially making mistakes? To make things more complicated, this person works at a consultancy, which means they switch clients and projects regularly and have to learn completely different things each time. That makes it all the more difficult to become a true expert in a domain without a much longer career history.

The consultant part puts a bit of a different spin of things, so I’ll cover that in the later half of this post.

But as for the core of the question, how do I handle the pressure of not making mistakes… well, I’m not super scared of making a giant visible mistake because I’ve got a decent set of risk mitigation habits in place. I also work in an environment where honest mistakes aren’t a problem. The worst that can usually happen is I have to face some uncomfortable questions while I promise to make updates to analyses.

Failure happens within a system

In a non-consultant environment, the most healthy response to failure is to use it to learn how to make things better at a system level. It’s important to recognize that failures tend to happen because of process and policy issues, not merely acts of an individual. While an individual shouldn’t delete the production database, it’s a failure of the system to have allowed individuals to delete that database without safeguards to begin with.

All professional work involves complex processes with multiple points of potential failure, many of which are outside the control of any given individual. Given that systems can fail for all sorts of reasons, the only reasonable way to minimize failure is to tackle the problem at that system level. We can’t expect humans to be perfect or even aware of everything enough to not make mistakes.

We can see how this manifests in engineering culture where blameless post-mortems are often the ideal. Everyone is expected accidentally break production in their career. It’s a rite of passage. In good engineering cultures, it’s somewhat celebrated. Senior engineers will stand up and admit when they blow up a system, sometimes costing the company a lot of money. They’ll show how such situations are handled, and how systems are made more resilient against similar failures in the future.

The only reason any single engineer could have brought down a production system was because their code had managed to hit some weird edge case that wasn’t known about. The flaw wasn’t caught in code review and the damage wasn’t contained by other systems meant to catch such things. The solution isn’t to punish the engineer, but to improve processes so that similar events won’t happen again. Good engineering culture encourages people to step up and point out issues instead of hiding them because it leads to the greater good.

Luckily, data science took a lot of its engineering processes from software engineering teams, and so, most of the places I’ve worked at or otherwise observed have healthy cultures that embrace this philosophy. If we blow up production, or create a model that has data leakage that makes it useless on live data, we stop, we correct the issue, we learn from it. If we’re lucky, we give a talk about it at a Data Mishaps Night in the future.

Learning to avoid failures

But just because we are given a bit of space to make honest mistakes and failures, it doesn’t mean we can keep making the same mistakes all the time. Failing to learn from mistakes is never looked upon well.

So in the interest of avoiding mistakes, many of the more senior data people in industry all develop very similar basic processes in working with data. A lot of these are “obvious” once you’ve been burned by common mistakes enough times, but they’re not immediately obvious to people just starting out.

Knowing how to pull/work with the data

First, we need to be confident in pulling data and working with it. Ideally you’ve got domain knowledge about the systems and generating processes involved. You’ve hopefully learned the weird gotchas that lurk in the data, like how there’s duplicates of certain rows, or the timestamps are in particular time zone. Whenever you come across a new system, getting up to speed on its quirks is a top priority.

For people who are just starting out, or switching domains, this is a challenge. It takes time. It means learning and asking other experts a lot of very basic questions. The worst part is you often don’t know what questions to ask and the people you’re asking won’t be able to think of everything on the fly. But the important part is to take the time to do this necessary work and not rush past it.

Getting feedback early

One thing analysts starting out often have trouble with is not showing anyone their work until they’re “ready”. Usually, by the time they judge their work is ready to show someone else, they’ve already spent a bunch of days on an analysis and any mistakes would often result in throwing out days of work. Very often they might spend extra time making work “presentable”, sometimes going as far as having a formalized slide deck with a complete story.

My advice here is to start preparing to show people your work the moment you actually have something in your head you want to say. That is, when you first start forming that initial analysis that feels like it’s going somewhere. Then, get ready to actually show it to someone once you have the initial charts and plots that make some kind of sense. This is going to be uncomfortably early in the work process.

In my experience, I catch a lot of my mistakes simply by preparing to show someone else. That is, if I slap down a series of basic plots and charts into a plain slide deck that acts as image container and try to write out the context and reasoning for the story I’m trying to tell, I’ll start catching things I missed. Simple things like talking out (to myself) the various steps to the analysis and conclusions would usually bring up questions like, how do I know this? How is this justified? What about that alternative explanations?

Just like with rubber duck programming, forcing myself to stop making the many leaps of intuition I use and write everything down so that others can follow helps me catch errors.

Once you get something coherent enough to show someone, you should show the work to someone that could be considered an expert of the thing you’re analyzing. What you want to know is whether your conclusions and reasoning actually align with the understanding of the expert. You want them to ask you all the questions that help them become confident in your analysis, as well as bring up all the objections and caveats they can think of. Those questions are usually going to be the same one that important stakeholders are going to want to know.

Be very aware of confirmation bias

If you ask any experienced data analyst what is a sign that they might be a making a mistake, I’m absolutely convinced everyone will mention a variation of this. When an analysis goes “too well”, when you quickly find the data you want, and it’s confirming a hypothesis you have, this is the exact moment to be on high alert that you’re making a mistake. Confirmation bias is a hell of a drug, and everyone has a story or three where they make a glaring mistake because they stop being critical about their work too early in the process.

Many of us eventually learn to double, and triple check our work when things just feel like they’re going too smoothly. It’s a good habit to get into.

Gracefully failing

Okay, having safeguards to prevent mistakes is important work — creating and updating those processes is one of the responsibilities of more senior folk. But just like how engineering teams have playbooks written beforehand to handle outages, we all also need to know what to do when things actually go wrong.

Probably the scariest thing to experience is presenting something to a very important stakeholder, like the executive suite, and someone brings up a significant caveat or issue that you had not encountered. For example, maybe there’s a very large customer with a sweetheart contract that completely messes up your revenue projections. It’s certainly happened to me multiple times throughout my career, and I’m sure I’ll happen again despite all the preparation that I’ll do beforehand.

While everyone’s strategies for handling such situations will vary, probably the only one that you shouldn’t do is get into an argument about whether that issue is valid or not. By definition, if it’s something you haven’t encountered or thought about before, you’re going to be in a poor position to put up a decent argument to begin with. Getting into a fight without being prepared is just going burn precious trust and credibility.

My personal strategy is to just openly admit that the issue wasn’t something I had considered or known about before and I’ll follow up with an updated analysis. If I have a good understanding of the issue and the underlying data, I might propose some speculation as to how such an issue may or may not affect the overall analysis. Sometimes, such caveats will materially affect the story being told, other times they won’t.

But what about “can’t fail” situations?

There are just some situations where failure is much less acceptable. For example, things involving the health and safety of people, or sending things out to the media/press where it’s hard to “undo”.

The problem is that no risk can ever really be eliminated, just mitigated. Every mitigation can fail for a myriad of reasons. No matter how important something is, the best you can do is spend resources to set up increasingly elaborate defenses to bring the chances down to whatever level is deemed acceptable.

Maybe you require that two separate teams using different methods do an analysis and see if they come to comparable conclusions. Maybe you bring in outside 3rd party auditors to act as a cross check. Maybe you do an experiment multiple times to make sure an effect is reproducible. Maybe you do all of those and add some more. At some point the cost of mitigation is going to outstrip the actual risk.

But then… consultants

So all the above presupposes a typical industry job that has a healthy work environment where there’s a good amount of psychological safety. It doesn’t include toxic workplaces where failure isn’t tolerated (and I’ve been my share of those). It’s also may only selectively apply to people who work as consultants.

Consultants are in a weird spot because they hop from job to job. It means they have no chance to master any particular data system before they have to move on. They also typically market themselves as experts in a particular service or field, but they trade mastery over a given context (like your company’s weird database) for expertise over a broader problem space.

This means that for every given job, they’re not going to know all the little caveats that lurk within a given client’s data systems. Sure, they might have domain knowledge of the space and know roughly what the data should look like, but every client’s system is going to have unique bugs and quirks that will have to be figured out on a very short timeline.

To make things more complicated, consultants have an image of expertise to maintain. While no one is going to blame them for having to ask questions about how a client’s specific data system works, it usually doesn’t look good to present obvious mistakes to clients. Failure is just less acceptable in many contexts. That means the whole calculus about being safe to make mistakes gets weird.

So very often, junior consultants have to learn those dynamics from more senior ones. Since it’s less appropriate to make mistakes externally, a lot of the mitigation methods have to happen internally. You get peers to cross check, you find people to review presentations before hand, you find people to safely poke holes in analyses before it becomes visible externally.

Whatever happens, you’ll learn and move on

Probably the scariest thing about failure is that, until you’ve been through a few, it seems like the end of the world. But life usually finds a way to carry on.

Even if you actually do get fired because of some critical mistake you’ve made (which is extremely rare), there’s usually a path forward. Life doesn’t have a fade to black and end credits sequence.

Tomorrow will come, along with new challenges and opportunities.

Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.

Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.

Support the newsletter:

This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:

Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Tweet me with comments and questions
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!