Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.
PyData NYC 2023 is over! I got to give my talk, people laughed and seemed to have a good time — which is normally my goal for a talk. Oddly enough, though I've done lots of internal company presentations over the years, as well as a few short Zoom talks at virtual events, this was the first time I've spoken publicly to a live, public audience for 30 minutes. It was also fun to walk around and bump into people and everyone was very friendly thanks to the event being at the intersection of Python and Data communities.
The talk itself was recorded and the staff will eventually upload it to Pydata's YouTube channel. But it sounds like it might take a month or two before they get edited and uploaded, by which time I will have completely forgotten much of the details of the talk to write about.
So while it's still fresh in memory, this week's post will be a re-enactment of my typical “slides with lots of ad-lib cheeky commentary”. It’s not really a transcript because probably half of what I said was made up on the spot so this version will be differ a bit from the video that’ll appear in the future.
“Solving the problems in front of you” (Talk slides)
[Note: it’s not necessary to read the slides in parallel to this.]
So, as many of you know, I currently work for Google as a Quantitative UX Researcher. It's an uncommon job title, but the tl;dr of it is that we use data science and social science tools to understand how and why users are using a product, and then take that understanding to engineering and product teams to help them make a better product. This is relevant today because if we need to collect data for research in a specific way but can't sweet talk engineering into building a system for us, we go ahead and build and maintain systems ourselves.
Prior to Google, I've worked for 10 years at various startups in NYC sized between 25-120 people as data analyst and engineer. So I've seen things happen across the spectrum of org sizes.
I’d like you to take a moment and think about some “Big Problems” you've worked on. Building a “single source of truth” for data. Creating a data warehouse. Building a reporting system everyone will rely on.
How would you describe that how that went?
Speaking personally, I’d use the word “painful”. Other phrases that come to mind include — slow, conflicting requirements, not sure what I'm doing, too much choice. Maybe the nicest thing I can say about it was that mistakes were made, but eventually overcome.
Big projects are hard. We probably make them harder for themselves.
There's a famous adage in computer science, “premature optimization is the root of all evil”, made by Donald Knuth in 1974. In the paragraph that line is from, he is saying that we should forget about optimizing that gets small efficiencies maybe 97% of the time, but don't forget it in the critical 3% of the time. In the direct quote, he doesn't elaborate how we can distinguish when those times should be. (In fairness, in the rest of the paper he seems to imply we should be using what we would now call code profilers for this.)
While there are technical definitions about it, in common usage, the word “optimize” implies making something better. It is something that is done “later”. We make existing code faster. We remove inefficiencies from existing code.
But I'm here saying that optimization starts at the very beginning with design. We're going from an infinite runtime non-working state, and magically going to a finite runtime working state. That sounds like a hell of an optimization of reality.
We can use our imaginations to envision and reject inefficient solutions. We know that picking the right algorithm or architecture can have profound effects on everything downstream. We know how bad Bubble Sort is as an algorithm. It is an act of optimization when — while we're designing things — we don't even bother trying to use Bubble Sort for anything in our code.
So optimization is this weird thing where we're simultaneously warned against doing it too early, but we actually have to do it. It's not as simple as “just don't do it”.
The art of optimization is in knowing when to stop.
But that's a hard thing to do because there's many compelling reasons to optimize up front.
We've all been burned from bad systems before. The backend system that's just five different databases in a trenchcoat. Database fields that have been “repurposed” without the name changing. A “catchall” text/JSON field that's overloaded with 8MB of data. The person who says “we use Excel for that”.
We deeply know the pain of making use of these systems and rightfully don't want to build the next system that someone will ridicule. We can, and should, do better than that.
We're supposed to stand on the shoulders of giants and learn from the experiences of other people. Every problem that we’re solving has endless numbers of blog posts, academic papers, and books on the topic that we should be reading. Working at Google, there's a non-zero chance that someone who wrote a whole textbook on the problem might be in the building and I should maybe talk to them. That is, if I could find them… and be brave enough to bother them with my hilariously basic version of the problem.
Fixing our mistakes also seems difficult, so we want to get it right the first time. We'd have to publicly admit to failure. We'd admit that much of the time and resources that went into the project had been wasted. All the downstream work that relied on our stuff was also wasted. Then we'd also have to build a fix… and the fix isn’t guaranteed to work either. Especially when we just obviously failed at it the first time. Sometimes we lose so much credibility we might not get a chance.
But perhaps most importantly. It feels really good when I anticipate problems ahead of time and have a solution in place. Past-me usually just gives present-me problems to solve. It's really rare to get something positive from the past, so I remember the few instances when they happen.
So as an example, imagine we are asked to build a data warehouse. The business wants that infamous “single source of truth” for analysis. We've got a production PostgreSQL database. We've got a 3rd party vendor that sends us nightly CSVs. We have a ton of server logs in an HDFS cluster, the formats vary for… “historical reasons”. There's a few people with some “very important Excel files”.
I'm sure with just that prompt, you have a ton of options and opinions spinning around your mind right now. Do we keep using Hadoop? Maybe switch to one of those cloud-based data warehouse solutions? Or maybe buy a really big database?
You're probably thinking about other concerns like how to move data around, will there be DAGs and ETLs? What performance do we need? What kind of availability is expected? Every potential solution just explodes with more questions and constraints.
What would you build?
The answer you might find on the internet, or at talks, is likely to be “it depends”. It's a great safe answer for sharing on the internet to be right without committing. It's completely useless for you, a person who needs to actually launch something.
So you do your best. Maybe you look up how people design systems, stumble upon diagrams that explain the “system design process”. You get really confused how there's hundreds of diagrams, some are linear, some are circular, some are giant flow charts with looping branches.
Instead of figuring those diagrams out, you just try to be rational about things. Gather requirements, evaluate options, eliminate what won't work, make a prototype to prove your solution works, then build the actual thing.
Brief sidebar, but all this stuff can happen everywhere at any size organization. But being in a large enterprise means doing the same things but in hard mode. More people means there are more stakeholders, which brings with it more meetings, more requirements, more approvals, more bike sheds that need painting, more politics. Nothing good about the whole process comes from being bigger.
Even having more money and resources at an enterprise is really a curse. A startup doesn't have to consider a $10 million database vendor solution, but a giant megacorp might actually consider it. A giant enterprise might consider spending tons of engineering salary to build a solution in-house instead of using an existing solution. Either way you’re going to have to spend extra hours at least considering these ideas.
Back to designing our data warehouse. None of what you've been doing seems wrong. It's all very rational. You're considering your options, doing research to find the best fit. None of this looks like you're optimizing too early or over-designing.
But three weeks have passed…
Weekly standup is starting to get awkward…
But despite that, you're still worried about what-if scenarios. What if we go “webscale”. What if we find we suddenly need a lot more storage or CPU? What will future users of the system want to do with it?
You're also worried about the schema. How will you be handling join keys? Are you going to precompute any metrics? How are you going to handle schema changes?
At this point, if you stop for a moment, there are some signs that you're optimizing too much, too early. You've narrowed things down to a few options but no amount of new research seems to help you know which is better. You're accounting for situations that are increasingly unlikely to occur. The core design of the solution hasn't changed much and you're now just bolting on various extras.
Essentially, all this is starting to look a lot like a waterfall design project. Which we know is a bad model for software development.
But wait, you know of another extreme way of doing things that’s the opposite of what’s happening here — hackathons.
Fair warning, do NOT build your production data warehouse at a hackathon. What I'm saying is that we can learn a lot from hackathons.
They have fixed deadlines. The time pressure means you cannot spend all your time researching because you also need to budget time to execute. The deadline is also completely independent of the difficulty of the task. You can't cop out by saying a problem is harder and thus must need more time.
There's zero expectations. Everyone expects corners to be cut. No real consideration is given to performance, scaling, or elegant design. All of that can be left for a later time.
The code can be thrown away at the end. It's expected the quality will be poor to start with so you don't spend energy there. But more importantly, you will get to keep all the things that you've learned while writing that code. All the hidden assumptions and unforeseen requirements that you discovered will be useful to know in the future.
Finally, it's OK to fail at a hackathon. In some ways it's expected and the next version will be better. All that worry that comes with taking on a big risky project doesn’t apply.
So that's why I'm telling you to solve the problems in front of you. Hackathons force you into that mode, but you can actively choose to do this any time you want.
Time box all your exploration according to the risk. If you're launching a web site of cat pictures, a naive solution is probably fine. If you're launching a $10 billion one-of-a-kind space telescope to the far side of the moon, take more time to think and plan.
Handle just the clear requirements. Those provide your minimum specs to hit. They narrow down your choices for you. Remember that greedy algorithms are surprisingly good at searching solution spaces. And if you fail at the task, it just means you found newer requirements that you weren't aware of before.
Next, handle the obvious follow-ups. If your data warehouse gets built, you know certain teams want to do a specific analysis because they've been specifically asking about it already. Those use cases are worth doing because they're concrete. Don't entertain hypotheticals.
Design processes like functions. Don't do everything as a big interconnected monolith. If they have the same design and separation of concerns as you would when writing good code functions, if a piece like the database engine doesn't work out, you can swap in something with a similar interface.
We should expect to rebuild things. Most writers, myself included, feel strongly that rewriting is the most important part of writing. The first draft is never good. But it forces us to put everything down so that we can make it better the second time around.
Also remember that no robust system was designed that way from the start. I like to pick on how the major cloud providers give examples of how to host a web site on their platform. We all know the bare minimum you need is just a webserver with an internet connection, but somehow every provider puts tons of extra features like CDNs, load balancers, databases, caches, and logging infrastructure into the diagrams. Those pieces got added for specific reasons as simpler designs failed. If we don't care about things like caching, we can eliminate them. What's important is understanding why something is there and whether we want it.
Finally, I’d like to remind you to have faith in your future self and others. Think of all the times you found a very clever solution to squeeze insight out of a database design that didn't obviously let you do that analysis. You can, and will, so this in the future with whatever system you build too. So don't worry about things too much.
Let things be a problem for future-you. They can handle it.
Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.
Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.
Support the newsletter:
This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!
Make me recall The Mythical Man Month, on the one hand, and the UNIX foundational lore. I think of UNIX vs Windows as bookends for system design--bottom up vs top down. Organic vs rationalized. Democracy vs autocracy. You get the idea. The fundamental problem with solving information environments for large organizations is not software, not hardware, but brainware. Management’s explicit power to say yes seldom takes into account the implicit power of subordinates to think “yes, but, the way we do it now is an awful mashup and too much of the documentation consists of sticky notes but at least everyone KNOWS how to use it to get their work done.” It’s the IT Lament all over again: the System would work perfectly if only we could get rid of the users.