The kid brought home some form of non-COVID plague over the weekend and the adults of the family are flat on their backs. So this week’s post is a bit lighter than normal and maybe half of it is inspired by fever dreams…
One thing that I always found amusing about my career was that as the data analyst at various companies over the years, I’d inevitably wind up knowing the databases better than anyone else at the company. This includes the developers who would be making the software that reads and writes to the database, as well as the devops and administrator folk who handled the actual running and backups of the databases. Obviously, it’s that way because I was directly querying database tables all day every day, and would have large chunks of the tables memorized after a couple of months of work.
Later, I would find out that it wasn’t just my direct querying that played a part in this expertise — it was also because I had no abstraction layer between the database and myself. Developers very often use ORM’s, Object Relational Mappers, to allow their object-oriented programming objects to be able to interact with a database. These ORM tools also somewhat translate the declarative syntax of SQL into OO functions that can be more familiar to developers. (Incidentally R’s dplyr fills a similar niche, allowing R code to interface with a database “naturally”)
#An example of the SQLAlchemy ORM specifying a select statement with a join
>>> stmt = (
select(Address)
.join(Address.user)
.where(User.name == "sandy")
.where(Address.email_address == "sandy@sqlalchemy.org")
)
>>> sandy_address = session.scalars(stmt).one()
As people who write SQL as naturally as breathing, the ORM syntax seems downright alien since it’s not immediately clear why someone can just endlessly chain function calls for all the clauses.
In my experience, ORMs are largely fine for simple tasks, like fetching a handful of rows, updating a small number of fields. They’re also very useful in a programming context because you don’t have to specifically write the SQL, send the query to a DB cursor, and then do the work of fetching the rows out.
But I’ve seen ORM also cause a fair share of interesting problems.
Probably the most frustrating problem is the SQL that is generated can sometimes be very peculiar. Now, you’d think that the internal query optimizers in the database can largely work around oddly-written SQL and they can manage to pull off some miracles, but even those optimizers can fail if pushed too hard. I’ve actually had to help debug production issues where generated SQL was slowing a site to a crawl because the generated SQL somehow confused the optimizer enough that it decided to table scan instead of using an existing index. Eventually we had to put in an index specifically for that query just so the site could function.
The other problem I’ve found is that they can sometimes promote bad programming habits. Since they obfuscate the database querying process, it can lead developers to doing inefficient things. For example, instead of pulling all the fields they need in a single query, a developer might decide to pull things in as needed, which can result in multiple queries being sent. This of course incurs the overhead of executing a query multiple times. Do it enough times and it can be an issue.
ORMs were supposed to represent a way for developers to hide the frankly quite messy SQL layer involved in database work. What’s sad is that they sorta underdeliver on the promise quite a ways. Nowadays, there’s even more indirect methods of interfacing with a database via adding a GraphQL layer between the databases and other APIs.
Either way, all these abstraction methods are largely helpful to developers and turning them into a more familiar syntax. What would be interesting is if the analytics space, which STARTs with SQL, to try to do a similar thing for analytics. That is, bring more analytical functionality into the database via SQL-esque extensions. We techncially do have access to tools like Stored Procedures and User Defined Functions in many database systems, but they’re oddly quirky and aren’t used very often because you can’t version control them and they act like black boxes. The syntax is also very different between vendors, which makes it a pain to learn.
But since databases are moving towards having storage and compute being decoupled and distributed among huge clusters, its significantly easier now than ever before to put more computational load onto an analytical database. It used to be you didn’t want to do that since the DB was a shared resource running on one big machine, but that’s not the case now. I think it feels natural to want to distribute certain operations closer to where the data is being read and transform it before it even has to go over a network.
Even in the ancient days of manually writing MapReduce jobs by hand and all the inconvenience of not having an easy way to express unequal joins, it was still really powerful to offload a massively parallelizable operation like a simple grep over 50TB of data onto a whole cluster via the Hadoop ecosystem. If somehow we could leverage that infrastructure for embarrassingly parallel operations in the same way via the SQL toolpath to keep the network usage down, I think we’d be better off.
I don’t know if anyone’s actively working on this problem space, but a guy can dream.
Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.
New thing: I’m also considering occasionally hosting guests posts written by other people. If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.
Support the newsletter:
This newsletter is free and will continue to stay that way, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Tweet me with comments and questions
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!
Some brief musings on SQL abstrations
I'd be interested in collaboration on a post