Caching makes the much of the world go faster.
We live in an age of massive, distributed, interconnected computing systems. Whether you’re making use of “the cloud” or even a small app running on a single machine, there is, without a doubt, a lot of caching going on at all levels. On most days, they are transparent to us end users — our lives would be significantly worse without them.
In general, we like caching systems because they're very efficient ways of boosting the overall responsiveness and performance of a system. How do you keep your web site running while a million users constantly hit refresh? How do you display the results of a 1-hour query when everyone will be loading the chart at the same time for the big meeting? The answer is very often “put (a variant of) a cache in there”. We might call it something else like a CDN, or a memo, or a buffer, or any number of related concepts, but zoomed out the strategies all resemble each other. Whatever the specific technique, temporarily storing a thing between source and requester comes with lots of useful properties like lowering latency for the requester and saving work at the source. The pattern is so ubiquitous that almost everything we do these days goes through multiple layers of caches.
But all the convenience the techniques bring also can create situations that lead to weird or occasionally bad user experiences. And as people who are interested in understanding users, those experiences then become these weird, near-impossible, measurement puzzles for us.
For example, while testing out a product with my team, we tried to be good administrators and did our best to give our tester user account the minimal permissions necessary for the task. This meant starting with just a few permissions and adding just what we needed. We had to guess what permissions we needed based off skimming docs. When we guessed wrong and the product failed to work, we needed to try again until we got things to work. Trial and error is already a poor user experience, but it was amplified a thousand times over because we eventually learned that a permission change might take a couple of minutes to “go through”. There was a cache mechanism somewhere in the global permissions/access systems and how everything checks permissions that was utterly ruining our day. Make too many bad permissions guesses and we’d lose our whole work session. In the end, just for testing expediency, we gave our account admin level permissions so we could go on with the test in reasonable time. We literally had to abandon our security goal for speed, not something we want anyone to have to do.
As a quantitative UX researcher, I'm super interested in figuring out how many people hit this similar experience… except it’s a convoluted mess. Should I stalk a user's permission-related click stream and see them possibly adding and/or removing permissions until they stop with multi-minute waits in between? Without any indication what their intent is to separate them from other users? At scale? It is far from easy to get to any answer, let alone a clear answer.
Other times, caching does stuff like mess with benchmarks because they speed things up when you don't want them do. Benchmarking procedures have to run code multiple times just to average out the effects of the cache hiding in your disk drive or your CPU shaving off critical milliseconds off a load time. Page load data can be affected by the local cache of a user's browser eliminating network latency… unless they force a refresh or you invalidate the cache somehow.
The more you get into the business of measuring things about systems, the more you start realizing that there are caches waiting to trip you up everywhere. You'll get errors that happen sometimes but then suddenly become better, or find bugs that make no sense unless a cache is involved.
Just the other day, I had a DAG pipeline that broke due to a typo. Fixing the typo and rerunning the pipeline didn't fix the issue. It was bewildering. Eventually I realized that hitting “retry” on my failed pipeline execution kept failing, but running a brand new run of the DAG worked. Why? Because apparently all the queries in my DAG were saved at the initial run and my typo was also saved in that run for eternity. This type of caching exists to prevent a DAG from being edited in the middle of execution and crashing in a mixed state, it's generally a good idea for a very likely error condition. You actually don't want to be debugging these kinds of bugs. But it looks like a bizarre error in my situation.
Caches and their eventual expiration are often why we can run the same command on the same computer system and magically get a different result. Normal wisdom would say that deterministic machines like computers would always give the same results for the same inputs (deliberate randomness aside).
I’m sure that you’ve already been solving caching issues in your life. Think of all the times you’ve had to ctrl+refresh a web page to clear up issues, or “just try it again maybe it’ll work”. The next step is start paying attention and identifying when you’re actually interacting with a cached version of something and it’s causing you trouble, either as an end user or a data scientist. Handling the edge cases for when they misbehave, or when get in the way of what you’re trying to do, present fascinating challenges that can stretch your understanding of how stuff works, in a good way.
Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With excursions into other fun topics.
Curated archive of evergreen posts can be found at randyau.com.
Join the Approaching Significance Discord, where data folk hang out and can talk a bit about data, and a bit about everything else.
All photos/drawings used are taken/created by Randy unless otherwise noted.
Supporting this newsletter:
This newsletter is free, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Tweet me - Comments and questions are always welcome, they often inspire new posts
A small one-time donation at Ko-fi - Thanks to everyone who’s sent a small donation! I read every single note!
If shirts and swag are more your style there’s some here - There’s a plane w/ dots shirt available!