I have a silly confession to make.
When I’m exploring a large, complicated table, I often have to refer to some variation select * limit 5
output to see what sort of data is lurking within the fields, or to even see what fields exist. I feel no guilt or shame in doing this, since I firmly believe that the only way to work with data is to inspect samples of it to see what’s really going on in the underlying business logic. The part I feel guilty about is that I’ll run the same sampling query multiple times in the same work session because it’s easier for me to just run the query whenever I need to check some stuff than it is to click 3 times to go back to a previous run of the exact same query.
For some bizarre reason, I appear to feel that re-running a small query is “easier” than using an included “see past results” function. It’s not even a question of me being too lazy to save the query results away because the frontend did all that work for me. Somehow, I’m more willing to move my cursor, highlight the query, hit ctrl+enter to do this action than lift my hand and click the mouse a couple of times.
The ramifications of what can only be called my own laziness is relatively minor. At worst, it spends electricity to send a message to a cluster of computers, they find a list of recent data files, and tell one worker machine to open and read the first few lines of the data. In the grand scheme of things, it’d cost an fraction of a penny in terms of cloud computing spend, and maybe a tiny amount of electricity and any associated carbon dioxide emissions. It’s a rounding error and thus, unimportant. I’ve literally used more electricity keeping my screen on in the time it took to type out this introduction story.
But this pattern of behavior is just a trivial example of a much more significant behavior — we constantly are trading efficiency of execution (in terms of money, execution time, resource usage, and others) for convenience. It’s easier for me to be lazy and do my work smoothly that way, than it is to worry about whether my repeat querying is wasting resources in the broad cloud. The system doesn’t tell me how many resources I’m consuming, and my only feedback is often at the end of the month when I get the cloud bill and see that I spent $xyz. I’m also likely to not even care about any additional pennies my silly habit generates (if I can even see it on the bill at all).
Thus, in my typical day of work, I have absolutely no incentives to even think about efficiency until I work with actual large scale data that takes multiple minutes or hours to run. I’m willing to bet that this is also true for you, reader.
Being “wasteful” in this way isn’t all that bad
It’s tired old cliché these days to say that we have more computing power (in terms of both CPU and storage) in our phones, or even our normal, non-self-driving cars than the computers that landed humans on the moon, or were on the US Space Shuttle. Meanwhile we all just use that computing power to make increasingly pretty graphics, doomscroll on Twitter, and otherwise entertain ourselves.
This commentary is usually couched as a allusion to “we’re so wasteful now, and programmers back there were Real Programmers”. After all, if those programmers in the ‘60s could land people on the moon with those ancient computers, imagine what they could have done with exponentially more computing power!
That whole line of reasoning is a bunch of nonsense. The convenience of modern programming languages, and the increased computing power needed to enable that convenience, made programming so much more accessible, our whole modern civilization depends on it. Imagine if you had to hand-code assembly to post a simple web page on the internet — how many hours will it take you, if you could even do it at all? How fast would it take to debug and launch a new version of any software if we had to write it in something like BASIC? If it takes me 15 minutes of concentrated thought to turn my 15 second one-off analysis query into a 5 second one, I’ve actually wasted 14 minutes and 50 seconds of my time.
The ease of access and convenience has more than paid for any real or imagined cost in “wasteful” programming practices over the years. The whole system has improved thanks to the exponential increase in raw computing capacity through Moore’s Law, as well as continuous innovation that converts that extra computing power into better developer usability. The difference between the two curves of power and convenience, is what gives us our complex technology landscape today where it’s super inexpensive, both in terms of resources and time, to develop new software.
But, efficiency gains are slowing
There was short-ish period of recent history, somewhere roughly around 2015-2020, when lots of people were writing “Moore’s Law is Dead” type articles. What happened was that chip makers like Intel had increasing problems getting down the next smaller size of transistors (known as as the technology node for silicon semiconductor manufacturing).
People were low-key getting worked up over how Moore’s Law, the observation that the number of transistors in an integrated circuit doubles roughly every two years and has held true since 1975, was expected to break down within the decade because we’re starting to bump into constraints imposed by the laws of physics.
If you look around now in 2022, you barely hear anything about how Moore’s Law might be dead, and no one seems to be in a panic over it. This appears to be because we’ve been getting massive computing capacity gains in other ways than just raw CPU transistors now. GPUs have become monster computing platforms in their own right. Apple, cell phones, and Cloud platforms have all adopted to various degrees ARM’s RISC architectures over x86 to get efficiency gains. People are starting to talk more about “performance per watt (a reference to Moore’s Law-like Koomey’s law)” instead of transistor counts (much like how we eventually stopped talking about CPU clock speeds once we stopped making much headway there in the 2010s).
Against this backdrop of “the pace of silicon improvement is slowing”, the many clever folk further up the tech stack, in chip designs, hardware architecture, operating systems, and programming languages and frameworks are “picking up the slack” to give us continued performance boosts over time. In practice we’re still seeing impressive gains in our overall ability to compute.
Data work, being very high up in the technology stack, has always greatly benefitted from all the optimizations that went on before we even touch a device or data set. We get to play with fancy new GPUs, enjoy blazing fast interconnects between our cloud computing clusters, and we get can speed up our queries simply by choosing to store our data on SSDs instead of spinning hard disks. We even get to rely on our database vendor to write better SQL engines to run our queries faster across massive distributed datasets without having to change our queries.
The combined benefits of all these layers are great enough that our preferred tools tend to be “inefficient” languages like Python and R that prioritize ease of use for the developer over raw performance and efficiency (ref: “Energy Efficiency across Programming Languages” by Pereira, Couto, et al.). Our primary nod to optimizing our queries tends to be when we finally write something that takes 30 minutes (or 30 hours) to run, or otherwise costs $15k to execute.
But leaner times are coming
As of this writing in August 2022, the tech sector is bracing itself for a possible economic downturn. No one really knows just how bad the next recession will be, but it seems like everyone is slowing hiring and trimming down unnecessary expenses to avoid the awkward situation of hiring too much now only to be forced to hold disruptive layoffs in the next year or so.
At the same time, with the slowing of computing power gains, there’s going to come a day in the far-ish future where throwing more hardware at our data isn’t going to work. The AI/ML community has already hit this point where they’re in a bizarre arms race to build ever-larger gajillion-parameter large language models, and at some point that whole operation hits diminishing returns and the costs start coming into the foreground.
Meanwhile, in more practical terms, bean counters might start questioning why is this month’s BigQuery bill $50k. Do we REALLY need to spend $250k maintaining that Hadoop cluster? Who’s actually using all those Tableau seats, and can we just replace a bunch of them with an automated email?
So I expect that as things get tighter, we’re all going to be brushing up on our system optimization skills a bit more. (Incidentally, I wrote a super long post about optimizing SQL 3 years ago.) Our lazy data pipelines are going to get some of the fat trimmed. And hopefully, we as data scientists will collectively build up more of the tooling necessary to do more efficient work that doesn’t sacrifice all the convenience and usability benefits that we need to allow us to solve tricky data problems creatively.
Constraints will bring creativity
For the past 15 years, much of data science has operated in a relative land of plenty. Unless you were at the forefront of data tech and pushing at the boundaries of the state-of-the-art, you probably operated under very few constraints. Resources were relatively plentiful and a hardware or software upgrade can bring easy performance gains. I remember the joy of seeing my query times drop by almost half simply because the analytics server got upgraded to SSDs back in the 2010s.
But not having constraints comes at the cost that we’re not forced to exercise our creative muscles at overcoming problems in a cost-efficient manner. It’s similar to how the limitations of musical instruments, art media, or poetry forms can actually inspire people to do much more impressive things than they would otherwise.
People are likely going to have to abandon (if they didn’t already) the “log it all and figure it out in the warehouse” methodology of logging telemetry. They’re going to need to have more clever algorithms to isolate interesting signal from noise. We’re all going to need better tools to help us avoid running expensive queries BEFORE we hit the “Go” button.
It’ll be more of a struggle, but I think it’ll also be pretty darn interesting.
Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With excursions into other fun topics.
Curated archive of evergreen posts can be found at randyau.com.
Join the Approaching Significance Discord, where data folk hang out and can talk a bit about data, and a bit about everything else.
All photos/drawings used are taken/created by Randy unless otherwise noted.
Supporting this newsletter:
This newsletter is free, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Tweet me - Comments and questions are always welcome, they often inspire new posts
A small one-time donation at Ko-fi - Thanks to everyone who’s sent a small donation! I read every single note!
If shirts and swag are more your style there’s some here - There’s a plane w/ dots shirt available!