Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.
This week, I had started by wanting to write about some of the tools I use in my day job as a quantitative UX researcher (spoilers, it's mostly SQL) but the more I worked on it, the more I needed to add context around it and things kept snowballing. At the end of the day, the work dictates the tools used. And once context gets added… well, it turns into a whole different kind of post.
Background context
Since the work dictates the tools we use, I need to explain why my work is the way it is. I happen to work in the Cloud part of the megacorp, and within that huge organization of many thousands of people, I happen to work on a bunch of storage related products. If you were pushing bytes into the cloud and used the web UI in the past few years, you very likely had touched something that I had provided research for at some point.
Cloud is a great home for me because despite the ridiculous amounts of money sloshing around the sector, it is still very much like the small 80-150 people startups that I spent the first 10 years of my career at. The competition is fierce with new entrants constantly popping up. Innovation and change is constant, as are the feature requests from users. Priorities an shift in an eyeblink thanks to market trends and strategic decisions. It's a place where my broad generalist skills, not just in data science but also using infrastructure and tech in general, that has lots of places to find applications.
The teams working on products in this space are, with some exceptions, looking for big, step-function improvements. If a brand new product or feature only has 500 users, we’re looking for 10x, 50x, 100x type changes in behavior where possible. A mere 10% increase is effectively a rounding error when everyone assumes that there are much bigger effects to be found out there.
The name of the game here is speed and lowering the inherent uncertainty of making decisions in a rapidly shifting market. We want tight loops of hypothesis generation from looking at data and talking to stakeholders, pulling in patterns and domain knowledge from all over the place. Then we decide what is worth following up on based on what we assume the potential payoffs are. Finally we take a chance on experiments and look for big, giant, effect sizes where advanced statistics isn't strictly necessary to know that something is happening. If things don’t work out and look unpromising, we move on.
It’s a pretty hectic situation where some projects can become “prioritized” in one quarter but then lose their priority work standing over the course of a year as new information comes in. Teams can literally blink into and out of existence along with the changing priorities. You can either learn to love this sort of chaos, like me, or learn to avoid it.
As an aside, we do have longer-term projects that have been stable for over a year, so it’s not all in a constant state of churn, but the chaos level can bounce between “moderate” and “overwhelming” with little warning.
This stands in opposition to a lot of other industry sectors. For example, you don't get away with this sort of nonsense as an insurance actuary calculating the prices of premiums. Nor would this in highly mature fields like search engines and ads be expecting for +50% gains on a launch — they’ve already done all the low-hanging high-impact work years ago. I’m pretty sure all those people live very different professional lives.
I should also note that I’m a decent ways into my career. That means there’s not as much of the “do this analysis, give that presentation” work and much more developing processes, convincing other teams to do things, and talking to people.
A typical day
So against this hectic background startup-y life at a megacorp, what's a typical day look like for me?
First, there are always lots of meetings. Large organizations of disparate teams trying to cooperate and move in the same direction just demands lots of overhead spent making sure everyone is in agreement about what needs to be done. This is definitely not how a 100-person startup does things. The overhead makes everything take longer. But you ought to see the utter chaos and dysfunction that happens if people try to undercut doing this work.
Depending on the specific day of the week and what my partner teams are doing, I have meetings ranging from 1hr to 5hrs out of an 8hr nominal workday. My local group is lucky in that our director has declared no-meeting-Fridays for our group that gives us a way to push back on other teams scheduling meetings that day.
Meetings tend to break down into a couple of categories:
Organizational meetings: Status updates, meeting with my peers, the occasional team all-hands. I don’t have too many of these, but there’s probably a 5-8 hours a month all together. (So maybe 5% of a 20d x 8hr = 160 work month.)
Project-centric meetings: Talking to teams, understanding what they need, presenting findings, work sessions with others. These take up a significant chunk of time and are probably the most common type. I’d estimate 4-6 hours a week is typical, so maybe 15% of a work month at the upper end.
People meetings: I’ve got a bunch of 1-on-1 meetings with various people: my manager, the lead of the qualitative research team I work with, 4 people that I’m mentoring in various ways, a couple of quant research peers… at around 2.5 hours a week, 6% of a work month. More junior people would have significantly less of this stuff.
Add all those up and maybe a quarter of my month is eaten up by meetings as a high estimate. It could be worse, if I were a manager, that number rapidly approaches 100% as meetings become where most of the important work of organization and negotiation happen.
Still, the giant meeting number actually varies significantly in practice based on how much other teams need me (or not) for urgent projects. Every week comes with new surprises. Every so often, every is heads down on their own stuff and I get a week moooooostly to myself! It can happen!
The remaining time
For the sake of argument, let’s say 18% of my time, 8ish hours a week for lunch, making the tea that fuels my life, snack breaks, etc. That leaves a bit over half the month of hours left, at 90 hours.
Finally, some heads down time to do actual work that doesn’t involve talking to people!
Work itself breaks down into two giant buckets, the specific balance between the two is completely unpredictable:
“Document” work
This has become an increasingly larger chunk of my work. This includes stuff like writing documents about process, research plans, and methodology documentation.
These docs are usually used by teams to learn from stuff that I worked on. For example, I might be writing a document about analyzing a specific class of problem the team has been facing so that other teams don’t have to reinvent the wheel. Maybe it’s something like “How to make sure your metrics dashboard using this data warehouse calculate quickly.”
Other docs are process docs like “we should be doing X, Y, Z for this research process because it helps us screw up less often!” Much of that function involves getting other teams to agree and go along (or getting an executive willing to help mandate it down). This is where I get to leverage past experiences of things going wrong to build things that won’t go as wrong as often.
I also lump “making a durable research report” in this bucket because usually I can slap together a research summary and present it to the requesting team easily, but it takes EXTRA effort to transform that raw analysis that requires a lot of voice-over into a piece of research that can be read and understood without that context.
At some point it becomes more important to make sure that other people that are referencing your work aren’t going to get tripped up by any difficulties you hit along the way. This is especially true in UX research where research findings have a pretty long shelf life. The fact that users prefer one design of purchasing flow over another can often be used in other contexts, so it’s very common to have someone else refer to past work and build off it.
Data work
Finally! Actual data work! At this point, it’s a tiny fraction of time any given month.
This work is where I actually reach into the vast seas of data files to attempt to answer the questions that teams bring to me. This is the familiar data science work. It involves examining data, cleaning it up, making sure I’m counting the correct numerators and denominators while not creating freaky duplicated rows of data.
But here’s a little secret about my data work. I try to do almost ALL of it within SQL, and only do the last bit of visualization and manipulation in a spreadsheet or just a slide deck.
The main reason for this behavior is I have to work on awkwardly large amount of data. Looking over 30 days of past event lots takes a lot of time and resources without the aid of giant clusters. SQL running on things like BigQuery is actually a hyper-accessible distributed compute interface. I once ran a bunch of regex on a couple of trillion strings using SQL. It’s likely possible to do this with an efficient regex parser built with something like golang. The problem is actually executing this code against the dataset. With the right SQL, I can usually shove a lot of the operations I need out into a massive cluster of machines without having to worry about a lot of ugly constraints inherent in writing custom code that depends on parallel execution on a cluster.
Sure, if I wanted to I could use a notebook like Colab, or do a “pull dataset from db, analyze it locally in Python” flow, but all that only runs on a single machine. If I were crazy enough to try to analyze the raw data files by streaming those bytes to a single machine to analyze, I’d be waiting an eternity just from network latency. The better solution would be to farm out the hard work to the cluster of machines that reads the data in. SQL is the fastest way to achieve that goal. And if I’m already going through the trouble of writing an efficient SQL query that distributes to a cluster well, I might as well just push things and do as much work in SQL as reasonable.
Then, remember how much of my work is in the startup-y style of moving fast and looking for big effect sizes? It means I’d never have to figure out how to run a specific statistical package on a giant mass of raw data on a giant parallel system. Since you can largely get away with very basic things like t-tests and ANOVA in SQL, I’ve avoided a lot of headache.
Once in a long while, I actually do have to drop down into a single computer to finish an analysis. For those rare instances, the bulk of the original data has already been simplified down via SQL into something that is workable in a single machine.
I think this is probably the main reason to learn SQL nowadays. The fact that it’s use cases expanded over the years to become the de facto interface for large datasets is pretty impressive. Very few people now use SQL in its more traditional “pull stuff out of a small relational database” use case now.
Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.
Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.
Support the newsletter:
This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Tweet me with comments and questions
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!