We all have our ways of picking up data to play with

And it's all uniquely niche in weird ways

Nov 17, 2020

A clementine wearing the top of an orange juice drink as a crown because we’re 170 years into November 2020 already.

A week ago I tweeted this:

As an artifact of how I learned and trained to be an analyst primarily via SQL+Excel... I still can't break the habit of doing initial data exploration/analysis by using eyeballs on large chunks of raw table output... significantly harder to do with programmatic tools

I don't quite remember the exact context, but I think I was pondering about workflows and how I have a very specific one grounded in my days as an analyst, and so it relies heavily upon SQL and spreadsheets.

Effectively, I often “pick up and inspect” the data manually like I were picking up a rock to examine. Despite the work mostly being done via keyboard and mouse, through sorts, pivot tables, quick charts, and SQL queries, it very much feels like I’m picking up a database table by hand and inspecting it. The decade+ of muscle memory probably adds to this feeling since it’s very easy to do things quickly.

Obviously, everyone does this their own unique little way, everyone eventually winds up in similar places (a complete data analysis) so the specifics of how we got there probably don’t matter too much in the grand scheme of things.

But there was one method that I had in mind that I have always wished I was better at. It’s the expiratory workflow where the data is loaded up directly into something like R, pandas, SPSS, etc. and the user does the initial examination and exploration of the data set from there. It’s effectively the most polar opposite you can be from my own style.

To me it feels the most alien and frankly it disturbs me at a deep level. But after thinking about what situations where that works, I think I’ve started to see some patterns.

For argument’s sake, let’s split the universe

For the purposes of this discussion, I’m going to split the universe into 2 camps. This is obviously not true, at the minimum it’s a linear spectrum, but it most likely is a multi-dimensional space. But I don’t want to get bogged down in edge cases so I’m going to break things into two camps.

I’m also going to caricature both sides to highlight their differences. I don’t think anyone would do exactly what I’d be writing here to such an exxtreme, so I don’t exactly recommend just blinding imitating ideas presented.

Camp one: The select * from data cluster

TJ Murphy @teej_m

@Randy_Au “How do you EDA?” “SELECT * LIMIT 500”

This is what I do, as to a apparently a bunch of people. Everything starts with looking at the raw data. Eyeballs on the maximally raw data, from either a SELECT * FROM table, common Unix tools like head, tail, grep, etc. , popping the file into a spreadsheet to see everything.

Very often, people who develop these habits have been burned before by sketchy data. NULLs in unexpected places, weird formatting issues, screwy text encodings, etc. I don’t have any stats on it, but I suspect that many folk in this camp come from a data engineering/analytics background.

Since I’m bad at naming things, I’m going to call this the Select* Camp

Lots of people on my twitter feed, to my surprise, actually use these methods. I was somewhat surprised since it’s such a squishy, messy way of doing things. You go in pretty much blind, stare at tables of numbers and strings, and move from there. It entirely depends on the skill of the analyst to know how to identify issues.

Camp two: The straight into Analytics tooling cluster

Since I’m not part of this camp, I can only speak to what it APPEARS to my eyes from the outside. So if anyone would like to set me straight, email, comment, or otherwise contact me and let me know!

But these folk experience data first through analytics tooling, R, pandas, SPSS, a SaaS Business Intelligence/Analytics platform of some sort. These people will of course look at the raw data at some point if something goes wrong, but descriptive statistics and plots are more likely to come first.

Essentially, if you were to follow a textbook for Exploratory Data Analysis, you would more closely go down this route. Like, I seriously can’t find any examples of EDA that don’t sound like this… they ALL effectively dive immediately into “Univariate non-graphical EDA” which is big words for descriptive stats/analysis of single variables. Maybe someone out there has an example they can point me to.

I’m not sure why is it that EDA is described this way. It seems to assume a base quality level of the data. The whole concept seems to track back into John Tukey’s Exploratory Data Analysis (1977), so maybe the followed the pattern set in that book. I haven’t read it so I’m not sure. It also could be because it’s very hard to write formalized methodology descriptions about “look at your data like it’s out to get you”. I’m not sure.

Again, I don’t quite have proof or stats on this, but I suspect that this camp might have a bit more academics in it.

I’m just going to call this group the EDA-Classic Camp.

Familiarity with the data set seems to be the difference

Thinking about the two camps a bit, I think the big determining factor between the two is familiarity with the underlying data.

The Select* Camp are more likely to be handed data sets that they’ve never seen before. Once you get handed a bunch of sketchy log files that were supposed to be clean but actually weren’t, you very quickly learn to be very skeptical of every piece of data that comes in.

Meanwhile, EDA-Classic, in a more academic setting, very likely collected the data themselves. Even fi they didn’t collect the data, they are often familiar with how it was collected. While even those academic data sets can be pretty dirty and require some cleaning, it’s probably isn’t a pile of irrelevant and noisy rows.

If I had put in the work in specifying all the data collection, or personally writing the data generation code, I would definitely be more confident that there aren’t nasty surprises. In which case, I, too, would also feel more comfortable diving into the data immediately with descriptive stats and charts. Because the hardest part of the data prep work, getting consistency, has already been done.

Both sides of course has blind spots

Nothing is free.

Manual inspection of raw data depends on chancing into the weirdness by random sampling. You also have to hone that skill because there’s no such thing as a standard checklist of issues to look out for. Everyone essentially checks for all the issues that have hurt them in the past. It’s also somewhat dubious how much data you can actually process

Plus, I personally notice that I don’t do all the plots and analysis that might be prudent to try when checking a data set out. It’s extremely fast to make a handful of charts with a spreadsheet, but it’s impossible to scale it out to make dozens, hundreds of charts like stats packages can do.

Meanwhile, if the data fails to import directly into your analytics software, you’re going to have to figure out why and correct it. That itself would very likely require manual inspection of the data. But of course, if these unexpected data shows up in a dataset you’re presumably familiar with… just what else has gone wrong? It raises the very scary possibility that something went wrong in the data collection.

Our methods are shaped by our experiences, they’re not universal

Since I firmly believe that people aren’t deliberately adopting suboptimal methods, the only real explanation is that the methods we pick worked, for the situations we find ourselves in.

But this spectrum of methods also highlights something important. Our methods and skills don’t perfectly generalize. People likely adopted their unique way of working because it, well, works. Presumably, other methods might have hit upon issues and snags, and this workflow evolved over time. These habitual methods might not be “the absolute best solution” for a problem, but provide a reasonable guess as to what works quite well in a given situation.

A very common conceit of the data analysis world is that the methods are largely universal. The algorithm of linear regression, random forests, and t-tests are robust and can work almost anywhere you choose to apply them. Similarly, if everything is just a table of numbers and strings, then any analyst should be able to work with them.

Obviously reality doesn’t work this way. We need to be very careful when shifting to an unfamiliar domain because blindly plugging and chugging a formula leads to disaster. That pesky caveat about “subject matter expertise” being critical for analysis applies even at the most fundamental level of “how do I look at the data to ascertain that it’s even usable” part of the job.

About this newsletter

I’m Randy Au, currently a quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. The Counting Stuff newsletter is a weekly data/tech blog about the less-than-sexy aspects about data science, UX research and tech. With occasional excursions into other fun topics.

Comments and questions are always welcome, they often give me inspiration for new posts. Tweet me. Always feel free to share these free newsletter posts with others.

All photos/drawings used are taken/created by Randy unless otherwise noted.