Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.
Heads up! Quant UX Con 2022 is June 8-9 and (free) registration closes 10 days before the conference, making this week the last week you can sign up to attend virtually. The talk list is available to browse now, so if you’re interested in quantitative UX research, please join us!
Over the years, I've given, and taken, a large number of SQL related interviews. I now think that, outside of very specialized use cases that involve needing to work with very complex SQL queries from day 1, that the era of the SQL test needs to go away for something better.
Instead, for all but the most specialized roles that strictly need SQL (like a DBA), I’d like to replace it with some sort of “working with relational data” instead. I have a couple of reasons for this
The basic, most used parts of SQL is very easy to learn quickly
There are now other tools that are are SQL-like enough with relational concepts that you do similar operations without ever touching a relational database
The dialects for SQL are so different anyways testing can’t go too far
Basic SQL is very easy to learn
With some help and guidance.
Very early in my career in the mid2000s, I joined an ad-tech company as an analyst. On my first day, one of the senior analysts walked up and said “Hey, Randy. You’re good with Excel right? Well, come here and type what I tell you to type — select bid from clicks where date > ‘2022-05-01’. It’s just like excel formulas, vlookup, and pivot tables.”
That was the first time I had ever touched a database. From there, I learned on a live reporting database how to pull data quickly. I also learned to optimize my queries because if I wrote something very slow, it’d lock up reporting and there’ll be angry shouts from the executive offices 25 feet away about how things were broken. Toxic work environment aside, I got very good at writing basic queries in a couple of days, and writing fast queries within a few weeks.
At the time, the hardest thing to learn was making the connection of how the JOIN ON conditions worked, and how it was reminiscent of a complex VLOOKUP setups. I also had to learn a bit about using indexes because MySQL of the era required constant hand tuning of index selection.
It seems that so long as people have some experience working with tables of data, whether in Excel, Python, or R, AND they have someone looking over their shoulder and giving advice for a couple of hours, they can learn SQL on the job. A lot of the difficulty of learning SQL on your own has more to do with needing to have access to a functional database with real data inside — take away that hurdle and most people can do it.
There are other tools that teach SQL-like semantics
Last week, I saw a tweet scroll by that I can’t find right now where someone said that pandas and tidyverse/dplyr are actually equivalents to SQL — they’re all tools for manipulating data frames. At a high level, I agree with this sentiment.
Historically, it seems that pandas and dplyr based much of their syntax and overall design on SQL since it was the more established technology. That’s where we get commands like group_by()
and aggregate()
. There’s also facilities for filtering down dataframes based on conditions, joining multiple dataframes together based on keys, etc.
So the SQL-like concepts are baked into the frameworks from the start. Any power user of the frameworks would be indirectly absorbing the concepts needed to understand how to use SQL Only syntax really stands in the way of jumping to SQL, and syntax isn’t particularly difficult, just quirky.
The dialects of SQL are too different to make great tests
I’ve told this story multiple times, but once I interviewed at a place that used PostgresSQL, and I had primarily been using MySQL 5.4 in production. During the SQL test portion of the interview, someone gave me a problem that they intended me to solve using window functions.
The problem was that MySQL 5.4 does not have window functions, they were introduced in MySQL 8. So I went about solving the problem using an extremely arcane way with complex subqueries to generate row numbers so that I could join them back in to simulate a “1st row” concept. The interviewer was utterly horrified at the massive query spawning onto the whiteboard. When we were discussing things, I did mention that MySQL didn’t have window functions at all so I wasn’t familiar with them.
So, for that and other reasons, I didn’t get an offer from that company.
But even if we fast forward to 2022 where most major databases, even MySQL, have window functions now, there’s still a huge differences between dialects. Adherence to the ANSI SQI standards is uneven at best, and almost every vendor has a disagreement over how various date/time functions are named or work. Lots of little idioms don’t directly translate. For example, I always have to look up the specific notation for denoting intervals (e.g. now() - interval 1 day
) because things are always very subtly different.
So unless you absolutely require experience with a particular dialect of SQL, you inevitably need to constrain the sorts of questions you can ask or risk rejecting people who just happen to be masters of a completely different dialect. For example, it’s very difficult to ask a query optimization question that isn’t completely trivial, while also avoiding deep implementation details of particular databases.
Trying too hard to equate “knows the SQL dialect we use” with “can do data analysis that touches SQL” is going to wind up trapping you with a smaller field of candidates. It’s made even worse if you happen to use more esoteric databases like Oracle that are well represented in some situations but aren’t nearly as popular in the broader data community.
If not SQL, then what?
The skill that I think we’re actually looking for when testing with SQL is “the ability to work with relational data”. We want to see data sets joined together, rows filtered and aggregated. We’d like to see some understanding about designing tables and normalizing data to reduce (unnecessary) duplication of information. Maybe if you need to check for deeper understanding, you’d like to see more complex things like self joins, joins using inequalities.
The SQL language is pretty much orthogonal to most of these concerns.
So I think we should just have candidates manipulate and design data sets in whatever way is most comfortable — even if it is via spreadsheet. With just a handful of example rows of some CSV files, you can ask people to load data into their favorite tool (or assume it is loaded into a database if they’re familiar with SQL) and just show how they would go about answering various questions about those tables. They’ll be able to show that they have an understanding of relational data just by doing the joining, filtering, aggregating operations needed to complete the task.
If you find a candidate that can do the relevant operations you need, then it should be pretty easy to teach them the SQL they’ll be using on the job.
Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With excursions into other fun topics.
Curated archive of evergreen posts can be found at randyau.com.
Join the Approaching Significance Discord, where data folk hang out and can talk a bit about data, and a bit about everything else.
All photos/drawings used are taken/created by Randy unless otherwise noted.
Supporting this newsletter:
This newsletter is free, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Tweet me - Comments and questions are always welcome, they often inspire new posts
A small one-time donation at Ko-fi - Thanks to everyone who’s sent a small donation! I read every single note!
If shirts and swag are more your style there’s some here - There’s a plane w/ dots shirt available!
I stopped interviewing for SQL skill about 15 years ago. It is an easy skill to learn -- at least for the most useful topics. If one were to take a course in SQL it would be a ~1 week course full time (another couple weeks to include the basics of RDBMS). If i had a candidate with all the other qualifications (smarts, edu, domain knowledge, passion, curiosity, other skills, etc) but did not know SQL I would still hire her. The rationale being that I could send her to take the SQL class and in one week have the complete package.
> It’s made even worse if you happen to use more esoteric databases like Oracle that are well represented in some situations but aren’t nearly as popular in the broader data community.
Yikes, I have some bad news for you about what big companies in big industries use overwhelmingly. Startups use Postgres or NoSQL (or fancier, trendy stuff), enterprises use Oracle or SQL Server. Yes, we've all seen exceptions to that, on both sides, but in general it's still true: the core transactional systems that run the mission-critical applications at large enterprises almost invariably use Oracle, or SQL Server... or they never got off their mainframe.
Benn Stancil made an excellent post a few weeks ago about an analogous situation, about how people making cutting-edge analytics tools don't just underrate Microsoft, they tend to consider it laughable that they should learn anything from PowerBI or think about integration with the existing tools that dominate at enterprises - and fail to recognize just how big the install base is, and how ingrained market leaders like that are. Maybe I'm overreacting (because it's a sidenote on a larger point about interviewing that I agree with), but I think I'm sensing the same sort of thinking here, and it jumped out at me.
https://benn.substack.com/p/case-for-consolidation