Doing better with Excel

Shooting slightly less of your foot off

Aug 17, 2021

A model train made of plant pieces at the NY Botanic Gardens many years ago at an annual event.

A couple of days ago, I got a pinged on Twitter about a topic that has been engraved into my soul over my many years as a data analyst — Excel Best Practices. Heck, in 2019 I had written a small post about how it’s okay to use spreadsheets in data science (tl;dr in certain use cases, like Exploratory Data Analysis and quick throwaway work, they’re great).

🤷🏻‍♀️ Cᴀsᴇʏ @kccarrell

Re: Excel Best Practices We’ve discussed this before (h/t @CMastication @Randy_Au) but I can’t find the thread. Recently started a new position with a new group and one of my roles is to mentor early/mid career analysts, beginning with cleaning up their ugly spreadsheets. 1/

Given the massive importance of Excel in just about every industry you can imagine, “doing Excel correctly” is of monumental importance. There are many well documented cases of Excel screwups leading to various forms of disaster. I won’t bore you with links to such disasters, readily found at your favorite search engine, but will re-link to the European Spreadsheet Risks Interest Group, a group of folk that come together to hold conferences (!) and review papers on ways to avoid screwing up your organization via spreadsheet mishaps.

So what’s some good prior art in this area?

Given how there are plenty of brilliant minds thinking very hard about the problem of avoiding spreadsheet screwups, what are some good references to look at?

Probably the most referenced (that I’ve come across) is a paper from The American Statistician in 2018 by Broman and Woo, “Data Organization in Spreadhseets”. It covers a lot of good rules to follow for using spreadsheets to collect, store, and analyze data.

Also in the Twitter tread above, there was this link:

Travis Gerke @travisgerke

@kccarrell @CMastication @Randy_Au I recently compiled some favorite references on this topic - hopefully some helpful links for you

tgerke.github.ioData sharing and organization

The presentation linked within the tweet is also great at summarizing what good data within a spreadsheet should look like.

Some common themes

While I highly recommend checking out the two above references for nuance, I’ll summarize some themes I’ve noticed

NEVER use color/text effects to indicate meaningful data - while there’s tons of reasons for this, probably the biggest one is because you can’t (easily) write a formula that expresses “if green, then 1”
One header row for every table of data, and please use good (understandable, unique, etc) names for fields
One piece of data per cell, no “Brooklyn, NY 11223” for “City State Zip” in a single cell
One row per data point, with no blank cells in the table
Use join keys to tie multiple tables together
ISO 8601 for dates
Be consistent with data entry (spaces, spelling, capitalization, etc), use the validation tools to enforce such rules
Formulas should apply consistently to the entire column/row without change, no one-off modified formula for row 46
Keep raw data around, never change it

So what do you have to add?

I was going to start classifying the different themes under the broad classes of what they’re trying to address, quirks in Excel (like the date thing), avoiding human error, avoiding anti-patterns, etc. But upon further thinking, I think such a classification would be a bit too abstract in helping a newer Excel user avoid issues unseen.

Luckily, if you’re interested in that, there are entire papers, like this one from 2008 by Powell, Baker and Lawson, that review and classify many different types of spreadsheet errors.

Instead, I think there’s at least one other lens to look at spreadsheets — the rules to avoid errors differ based on what you’re doing. Spreadsheets are used to accomplish tons of data tasks — entry, storage, ad-hoc analysis, “packaged” analysis tool, and even sharing/collaboration. Not every rule is relevant to everything. The overall theme of “prevent humans from making mistakes” remains, but the emphasis shifts.

When you’re doing data entry and storage, you’re most interested in guarding against certain issues. Human error in data entry can be prevented with data validation tools. Making sure your data is “tidy” ( in the R sense of tidy data ), in neat one-row-per-datapoint, with join keys across tables, is important for storing and maintaining data for future use.

I think that the majority of the guidance on avoiding spreadsheet issues focus on this data entry/storage part. It stands to reason because it’s the most common aspect to spreadsheet usage, every project will do it to some extent. So it will naturally be more prominent in errors.

When sharing and collaboration becomes a greater concern, you start having to deal with the issue of “make sure other humans will see where the machinery is”. Spreadsheets tend to hide the underlying machinery (the formulas and references) by default. It’s important to make sure that everyone looks in similar places, and can identify what can/can’t be changed. This is often where you start seeing conventions about where to put things, building data dictionaries, and using colors to highlight things.

As use cases shift a bit when you move into doing ad-hoc analysis. Now it’s very important that you leave your raw data alone, and leave an analysis paper trail that clearly documents how you go from raw data to your final analysis. There’s also a much greater need to stay organized, know when to fork calculations off into separate worksheets. You also need to start thinking about storing formula parameters in easily visible spots (often highlighted w/ colors for emphasis!). You’ll also have to put artifacts in easy-to-spot locations, parameters should be near the top and left, formulas need to be consistent going down a data table column, etc.

Also the tidy data part starts to break down in the analysis phase because eventually you need to start grouping and filtering and summarizing data into formatted plots and tables. Things become progressively more one-off-y to suit the particular need of the analysis and presentation layer.

Next, if you’re venturing into the realm of “spreadsheet as packaged analysis product”, which is something that happens pretty often in a consulting context, you have even more concerns about clients accidentally changing/breaking a spreadsheet. Now you need to use things like locking sheets and cells to prevent mishaps. VBA and scripting become more common in this space and now you have to make sure that stuff is properly documented in a way so that anyone trying to maintain that sheet even know to look for code that’s not immediately obvious.

I honestly don’t see too much discussion about this phase of spreadsheet work because the skills needed to reach this point is less common. If you’re writing VBA code to front-end your spreadsheet, you’re already in a new dimension of crazy. As with most things, entering a new dimension just allows for even more degrees of freedom to cause trouble. Now you can introduce straight up programming bugs since you have access to a Turing complete language. (Okay, as of 2021, Excel formulas themselves have become Turing complete so you are also free to break things that way too. See tweet below.)

Satya Nadella @satyanadella

Excel formulas, the world’s most popular programming language, is now Turing-complete. Go check it out!

Microsoft Excel @msexcel

Excel keeps evolving to give users even more. Now, with the power of LAMBDA, you can write your own reusable functions with the Excel formula language. See how we're transforming Excel: https://t.co/QaurxsvtHA https://t.co/wToQ7k3lPc

Finally, past the whole “packaged analysis product” stage of advanced spreadsheet shenanigans (and oftentimes before that point), you’ll hit a point of complexity where the answer to “how to minimizing spreadsheet issues” becomes “stop using a spreadsheet”.

Often the decision is made because everyone is better off using a Real Database instead of a spreadsheet. Sometimes it’s because you really want to stop recreating quirky business logic using Excel formulas and it’d be better to use production code to handle that stuff. Other times it’s because you need to share the data, or have it be readily available to multiple consumers. Very often, it’s because something tragic has happened to the sacred Excel file and data’s been tragically lost and there’s motivation to “do it right” now that it’s obvious how important the file was.

Ideally you’d want to make the decision before tragedy strikes but humans tend to like maintaining the status_quo.final.v54.seriously.xlsx

About this newsletter

I’m Randy Au, currently a Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. The Counting Stuff newsletter is a weekly data/tech blog about the less-than-sexy aspects about data science, UX research and tech. With occasional excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise noted.

Supporting this newsletter:

This newsletter is free, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:

Tweet me - Comments and questions are always welcome, they often inspire new posts
A small one-time donation at Ko-fi - Thanks to the folks who occasionally send a donation! I see the comments and read each one. I haven’t figured out a polite way of responding yet because distributed systems are hard. But it’s very appreciated!!!
Buy one of my photo prints
If shirts and swag are more your style there’s some here