Being a broad-spectrum data scientist

Is something of a niche. Heh.

Apr 06, 2021

Ways to support the newsletter is here!

I’ve debated for a long time whether I would ever take my substack into a paid subscription setup like many other substacks out there, the answer for at least the next couple of years is going to be Nope. This newsletter will continue to be free to all.

Between day job, family, a rampaging tornado of a toddler, and all my other projects, I don’t have bandwidth available to write more than one long form post a week. It’s also very hard for me to tier my writing into paywalled vs free since I write about whatever comes to mind.

So instead, I’ve set up some alternative, non-required ways to support this newsletter. There’s a Ko-fi tip jar available. There’s even a subscription option if you’d prefer. If someone takes up a subscription I’ll figure out something nice for those folk.

I’ve also started to list items in the Counting Stuff’s Ko-fi shop too. I’m first uploading are prints of my best photos taken over the years. Those will drop ship straight from my professional print lab, they even have canvas options available. Stickers will come next.

In the future, there’ll be other things that I manage to produce, expect some fun cut gemstones to start. Since building the shop is a side project of a side project, new items will trickle in over time.

A macro shot of learning gem #4 finished this weekend, a custom designed Legend of Zelda rupee in synthetic spinel. Macro magnification and harsh lighting are cruel mistresses for minor scratches.

As usual for the bird hellsite, a bunch of data science memeing had been going on.

Randy Au 🙃 @Randy_Au

Look who's just giving away my entire career strategy in a single tweet...

Demetri @PhDemetri

There is some serious money to be made just cleaning data. If you're interested in data science, I'm rooting for you, but you could become very very useful to just about everybody by diving into data no one wants to touch.

Demetri here shone a bit of a light on my little niche of the data science space. While the flashy Machine Learning-powered aspects of data science tend to get the most attention, there also exists a lot of less flashy data work that’s highly important, and thus highly compensated.

Everywhere in the world, there’s data. A huge amount is currently collected and held by people who don’t have the skills to put it into use. It’s stored in various software systems, spreadsheets, stacks of paper files, and sometimes in human brains. It’s stuff that those people use in their everyday work, and there’s a lot of money to be made helping those people put it to use.

Sometimes the problems are considered “uninteresting”, because they deal with tedious topics like handling inventory. Other times, the people who are collecting and using the data themselves aren’t haven’t sat down to think about how the data they’re using could be used for other interesting things. They might not even care since it might not make their lives easier.

This is an opportunity for people who love working with data and want to explore a high breadth approach to their career, as opposed to going deep into a given field.

Personally, as noted in the tweet, I’ve found myself to thrive in environments where I apply my skills broadly, as opposed to going deep into specific regions. I largely attribute this to quirks in my work habits.

What’s high-breadth data science?

It means applying data skills to lots and lots and lots of different fields. I’ve worn lots of hats over my career, and pretty much every single person in a company had access to some data that I could use to help them, or could be helped with data that was being collected by a third party. I’d then do the legwork to help people out with very basic data science methods.

As I’ve written about many times, I strongly disapprove of people jumping into fields they have no experience in and just declaring that they’re going to help with “algorithms”, so I’m most definitely not advocating for that approach. Instead, what I’m advocating for is taking the time to listen and understand what people are trying to accomplish, how their processes work, and then use basic data science methods to amplify those efforts.

Similarly, the consultant style of data science work, where highly paid teams are dropped in for limited engagements, is not (typically) what I’m talking about here either. The nature of expensive consulting work is that they also operate under certain domains of expertise. That domain is likely broader in scope than you’d find for a typical individual, but even within the consulting teams, there’s still quite a bit of specialization as they build out their toolbox and clients.

The work of a broad data scientist tends towards simple stuff — talk to stakeholders, understand the problem, take data, clean it up, summary stats and simple analysis, answer some questions, iterate. The subject matter experts I partner with can take that information alone and run pretty damn far with it. To take things to the next level, you can toss in some automation and reporting, maybe even a dashboard or a processing pipeline.

The majority of the work is wrangling the data into a usable format and developing a solid research question based on what you know about the problem space. It’s stuff that just about any fresh graduate from a data science bootcamp would be able to do at a technical level (assuming they have the skills to handle the non-technical challenges of the work).

It’s often not possible to use complicated methods because they require increasingly deep domain knowledge to make sure things work correctly. Everything from how to collect and clean the data, identifying and correcting issues and biaes, and even interpreting and using results. Sure, you could just throw black box magical AI fairy dust at the problem and see what happens, but it’s usually not a good idea.

But this work is very rarely done because, just like going super deep into a topic, going super wide is very difficult too.

Being broad means you deal with everyone’s weird data quirks

The hardest part about working with lots of fields is… working with lots of fields. Each one has its own language, expectations, conventions and norms. It takes a lot of effort to understand where people come from. The worst part is that if you don’t take that effort to understand, you’re either going to fail to create a useful solution at best, or burn social capital and damage your reputation.

It’s easy to grow frustrated when an inventory manager tells you they’re using some wonky 4-4-5 calendar for what feel like unfathomable reasons, or the warehouse only sends data in a sloppy excel file twice a day (unless they somehow randomly don’t). But you’ve no choice but to work with what’s available.

You also have to learn how and why different fields collect data. One team uses a unique piece of software from a vendor, and the only output mechanism is PDF. Another team might rely on data that comes from hand-entered forms or receipt scans. Other’s might do everything in a single spreadsheet from hell. Maybe things really are just that nuts, like how atomic clocks and leap seconds work, and everything has to approximate around it.

As evidenced by the things I write about in this newsletter, I tend to find a lot of things fascinating, so absorbing all the bits and pieces of those other areas isn’t much of a burden.

Thankfully, one of the nice things about breath is that once you start snowballing fields, it gets easier to use that knowledge to draw parallels to other fields. Just like how the learning curve of between Java and Python isn’t too painful compared to learning either from no programming background whatsoever. While you always have to be wary of over-generalizing knowledge, many things can transfer and things get easier with time.

Focus on the lowest hanging fruit

As a non-expert being dropped into a different field, you can’t expect to solve everyone’s problems with a few strokes of the keyboard. If that were possible, someone else would’ve done that already. Instead, your main goal is to identify where you can have big wins, without getting lost in minutiae that you don’t fully understand.

My strategy for identifying those is to ask people what they spend the most time doing during the day. Since data science is often about making better decisions more quickly, whether it’s through automation or decision support, there’s usually something that can be made more efficient with the help of data.

Speeding up decision making is very often a case of understanding what information people need, and giving it to them in a better way. Dashboards, often maligned for being rarely used and over-requested, can actually be very useful in some of these situations. Other times, giving people context like “similar requests came up X times this week, this is normal but this one here isn’t normal” can lighten cognitive load.

When you can’t do it alone, bridge

Very often, even those low hanging fruit problems aren’t completely solvable by a lone data scientist. Charts won’t change the world on their own, but can change how people think, who then go to change the world.

For example, imagine you had a subscription service where every new subscription was checked over by a human to reduce spammy/scammy folk from joining a forum. You’d love to block those out quickly, but you also don’t want a high false positive rate because it’d hurt the service and anger a lot of paying customers.

There’s definitely tech and data science methods you can use for this problem. This is old hat, fraud detection has a long history, is fairly well understood, there’s often off-the-shelf parts. But no one’s implemented those systems for this team yet because it just wasn’t deemed high enough priority to throw engineering effort behind it. Then you come around and find that it’s a problem after looking at what’s happening and talking to folk.

You instantly realize that you can’t just implement a spam control system yourself — it’s too big of an engineering task with lots of integration into existing systems. But you do know what a spam control system would look like, and your technical skills let you sketch out how a system might wire into existing infrastructure. You also have the data skills to size up the problem, and even do an analysis to show potential impact. You also have the visibility and credibility amongst all the teams to propose this project, and hopefully everyone agrees this would be a good investment of resources.

By sitting in the intersection of all these groups of people and being able to communicate with all of them, you’ve found a way to make people’s lives easier and everyone wins.

Your biggest impact is often through saving people’s time

If it weren’t for computers, data work tends to be very tedious. Manually entering data, checking data, looking up and using data to make decisions, it’s all time intensive. That’s part of the reason why data science requires programming skills, there’s no way to do the work without a certain level of automation and scaling.

Teams who incidentally generate data during their day-to-day activities often don’t have those skills. So when you’re helping out a team that doesn’t have access to programming resources, saving time through automation and simpler decision making jumps to the forefront. It becomes a big area where you can have a big impact. This is especially true when the teams get larger, saving everybody an hour of their time, multiplied by the team size, can quickly scale into a huge amount of time (and human frustration) saved.

Sure, it’s not quiiiite the same feeling as directly finding a few million dollars of extra revenue between the sofa cushions from running some random A/B experiment, but it can math out to being a similar amount of money.

Build and leverage connections

One final benefit for working broadly is that you’ll build connections all over the place. There’s all the individual and team connections, which you can leverage to get stuff done. There’s value in being able to bridge context between teams that might not be communicating closely.

There’s also all the disparate data sets that you touch and could possibly be combined to make newer, even more exiting things. Networks get stronger and more interesting every time you add a new node in, so there’s no telling where things can lead you.

About this newsletter

I’m Randy Au, currently a quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. The Counting Stuff newsletter is a weekly data/tech blog about the less-than-sexy aspects about data science, UX research and tech. With occasional excursions into other fun topics.

Comments and questions are always welcome, they often give me inspiration for new posts. Tweet me. Always feel free to share these free newsletter posts with others.

All photos/drawings used are taken/created by Randy unless otherwise noted.

Some Counting Stuff swag from RedBubble is available for print-on-demand here.

Support this free newsletter at Counting Stuff’s Ko-fi shop