Real Workflows in Data Science

Some of them, at least

Live view of the mess where this newsletter gets cobbled together

This week, someone asked a very interesting question. How do data scientists keep track of an analysis?

I have absolutely no idea.

The topic is not something that many (any?) of us talk about. The topic lives in a weird scope gap between the actual tools we use (which we discuss incessantly) and the big “plan how your projects will go” type discussions. It also feels like a quirky private aspects of doing research work. So long as the work actually gets done, we don’t usually talk about the smoking mess of files used to generate the result. I think we all believe our own methods are so natural that everyone will figure it out for themselves.

This is all interesting because there’s a lot of discussion about specific tools (like Jupyter Notebooks, R, Python) or certain task workflows (like creating an ML pipelines, handling inbound data requests). But nothing at the higher level of “being a good practitioner”, the gluing of various tools and processes together into a coherent research and analysis process.

What’s this? An unsexy data topic? Sounds like the perfect topic for this newsletter!

The only thing that I really know about workflows is that my own is a hopeless sea of chaos, much like my desk. Tons of half-finished projects waiting to be completed, some half-forgotten until something shakes them loose, all held together with a certain amount of spatial and temporal memory with occasional notes attached to critical parts.

My (not very good) workflow

The general workflow that I’ve developed over time is admittedly not very good. It’s also changed multiple times in order to adapt to the completely different environments I found myself in. Tech stacks radically changed from under me over the years, so it’s an abomination of what’s managed to survive a decade of upheaval.

Handling inbound requests

I keep a surprising number of inbound requests in memory, like, actual wetware human memory. This is not a good thing and I compensate for the mental load by leaving notes to myself. I usually have 1-2 “Large Projects” that are in flight under my control, then people come with small ad-hoc requests that I bang out within a day or three according to my priorities. These I mostly juggle in my mind and in a set of weekly notes I keep to track what I’ve done. There is no ticketing system because I don’t work very well with tickets when I’m effectively solo — there isn’t the bureaucratic need yet on my team.

Then there’s a whole battery of mid-sized or large projects where I’m contributing, but I’m not the main driver. For those, I often offload a lot of the tracking/prioritization to others. If things are important, they find their way to me and get popped on top of the attention queue.

But for all of these projects, big, small, or otherwise, I treat them as self-contained units of work. That means that they each must carry their own context near themselves all the time in case I ever have to look back at the work.

Current tools

For better or worse, everything at work is build to more or less function with Google Drive in mind. It’s great that stuff is all in one place, from spreadsheets, to slides, to colab notebook files. There’s search functionality to sift through it all, including files shared to me from others. But at the same time, it’s something of a nightmare to organize things by folders, so it’s effectively one infinite list of files.

Then I have separate tools to execute queries and create dashboards, tools to create sites if I need. That covers probably 99% of my daily work tooling. I don’t even really use version control because as a Quant UX Researcher, I do a lot more one-off type work for product teams instead of long-running pipeline engineering, so I don’t really need to write durable code. All this stuff sits in separate universes that largely don't intersect.

There’s also a ticketing system that I largely ignore because my UX team lives outside that system. Which is good, because I’m really lax about ticket usage.

Keeping an analysis paper trail

A lot of the quicker bits of work are just SQL query -> Analysis spreadsheet or Colab code -> findings reported in an email or a short deck. It’s fairly easy to make sure all the pieces stay together by just including the SQL query in my analysis file, and making sure the findings either link back to, or keep a reference (filename or a date) to the original analysis. That way I can always go and find all the pieces needed to recreate the work.

For bigger, more complex projects, very often there are tons of queries and explorations, multiple analysis files, multiple presentation decks for milestone work or small finding share-outs. It becomes progressively harder to keep everything glued together. So instead, I make it so that the files on the “critical path” of the analysis, the stuff that goes into the final presentations (and this sort of work always gets archived as a deck), are prominently linked to in an appendix in the presentation.

Side journeys are named/dated/packed in a way that they’re near the critical path files. Inside the side journeys would be any interesting notes and findings that didn’t pan out — it’s okay if the side stuff gets lost in the shuffle over time.

Sharing out and archival

As mentioned, my primary vehicle for sharing work out tends to be slide decks. Plenty of people hate slide decks, and some with good reason (there’s even that book on “How Powerpoint Makes You Stupid” [unaffiliated amazon link] which focuses on such arguments). But if you’ve been reading my newsletters, you’ve probably realized that I’m not really capable of oversimplification, so they come with appendix slides and detailed notes.

Thanks to that habit I tend to leverage these durable slide decks as my archival solution. They contain the context of the initial request, the problem we’re solving, a non-technical discussion of the methods used, the results, and finally links to all the important files. For me, that’s the complete package.

I can usually go through my files and hunt down an analysis presentation to get my full context and links to necessary data files. If I somehow forget the links (which happens sometimes if I’m moving quickly), I habitually put dates in my filenames so I can narrow a search to files created in the right timeframe.

Work also has a “Research Archive” site that’s specifically made for putting research up for long-term sharing. Occasionally, I put work that’s of sufficient quality on there too. It’s definitely a nice resource to have but not one that I take full advantage of.

Previous iterations

In previous positions, under different tech stacks, I’ve squirreled away records in weird places. I’ve put stacks of commented-out queries and links into often-used dashboards to justify metrics. I’d put all the Excel files for the year within a single folder and archive them away. I’d even check code into a work Github repo for data science work that was divorced from the data and then I'd lose track of analysis flows because we didn’t want to shove 10gb data files into Github (don’t do that, this was before they added blob support) and the code documentation was dodgy (my fault).

Clearly, I don’t do this super well

Stuff can get lost. I have trouble naming things in a way that I can scan my work and find the correct file over time. 2020-10-02_requestor_some_descriptive_words.csv doesn’t work nearly as well as it seems 6 months after I save the file. Search functionality can be hit or miss depending on whether the contents are indexed along with the filenames, or whether I can remember roughly what I decided to name something at the time. It takes a fair amount of detective work to track down things that go far back.

But luckily! A good number of data scientists from all around chimed in with their own workflows. They really run across a huge gamut, and it’s interesting to scan across the individual responses to see what people are using. Due to shoddy thread management on my part, somehow the discussion bifurcated into two threads =\. My excuse was I did most of this on my phone while outside.

Common Strategies I noticed

Almost everything in a single file — Large analysis notebooks that do all the work of providing historic context, exploratory analysis and code, as well as narration for sharing. Usually seen for smaller projects.

Hub-and-spoke — everything is referenced off a single point of reference so you only need to keep track of one object in order to find your way to everything else. This appears to be the more common pattern since it’s very simple. It pops up quite often when a ticketing system is involved because a ticket is a natural place to link everything. It's also possible to use another object, like the final report as the reference point.

Network of files — A kind of expanded version of hub-and-spoke where files provide references to other relevant files, but there’s no one giant view of everything. Some files have further links to other stuff. Maybe there's some cross-linking involved. It scales better for large projects where you might not care to link to an analysis done 6 months back because things have changed.

Grouped together in the same place — there are many files and other artifacts, but everything is located near each other, often within a folder of some kind of filesystem. This seems to be most common for repository-centric workflows, but can also apply to various file systems-based setups.

Using dates as a backup strategy — Sometimes, things get lost in the shuffle, but most projects are time-boxed enough that if you know roughly when to look, you can find missing files.

Breaking things down

Looking over the responses, a few high level patterns sorta emerged around the multitude of concerns that need to be balanced in any data scientist workflow:

  1. Retaining the initial research question context

  2. Maintain the analysis paper trail

  3. Split analysis into a “main branch” and “side explorations”

  4. Pass context between very different tools and environments

  5. Share code between analyses if necessary

  6. Share results with, sometimes non-technical, stakeholders

  7. Keep track of things outside of a project/request context

Not everyone needed to handle all aspects in all situations, but on average people had to consider them when building out a generalized workflow. Let’s start going into the pieces.

Retaining the initial research question context

This is the important information of “Who’s asking for this analysis, what were they trying to do? Why does this work even exist?”. We need it because the context helps everyone understand what certain work was done, and why certain work was NOT done. It sets the scope and the stage.

Often, people used ticketing systems like JIRA to do a lot of this work. People requesting work can fill it out and the ticket becomes the central reference point for all the work (much like how my result decks are).

For people who don’t leverage ticketing systems this way, everyone seems to have a way of keeping this context around, either in decks, personal notetaking software: README files, comments, or wikis. Everyone pretty much said it was important to keep track of.

Maintain the analysis paper trail

What I mean by “analysis paper trail” is the whole record of every step taken to go from raw data to results. It’s all (the transformation receipts. Just about everyone that r esponded knew how important it was to make sure the steps in an analysis were preserved, in case questions came up or an analysis needed to be run again.

But the actual method for doing this varied wildly and depended a lot on the specific work environment and tools available. Jupyter notebooks and Colab were often mentioned, since those code blocks are often used to note steps within an analysis. It’s exploratory nature also lets users test out ideas without paying a huge time cost.

People often said they’d comment their code, or add in narration of steps. Sometimes they’d name their files for different phases of an analysis.

Split analysis into a “main branch” and “side explorations”

Analysis leads to many strange hypotheses that don’t pan out in the end. Or you find something interesting, but it’s not particularly useful for the question at hand. This creates quite a bit of throw-away work that may never see the light of day. That is, until that work inspires something in the future.

So data scientists tend to want to hang on to those side journeys, just in case. Hey, we’re data people, we don’t throw out potentially useful data without a good reason.

To stay somewhat organized, it seems common to split this work off from the “main work” in some way. The most common is to just have separate files for different things. For smaller explorations, people may just keep them living within the Jupyter notebook and commented appropriately.

Still, having lots of analysis files floating around can be tricky to handle. After a few weeks away, it’s possible to forget which files are important for what. So people seem to use a mix of filename conventions, comments, READMEs, and links within other artifacts (tickets, decks, etc) to make sure it’s clear what is what.

Pass context between very different tools and environments

End-to-end, data science uses a huge number of tools. There’s our databases and the queries used to access them, there’s the code we write and potentially have to check in, there’s ticketing systems, wikis, presentations, and dashboards. Sometimes there’s email involved. We often have our own personal notes, in software, a doc, or on paper. We work in all these things at the same time and somehow have to pull things together. It’s hard.

I think this is where we all wind up finding things that work for our situations and just stick with it. Some people rely on a central hub like a ticket or a slide deck in my case that links out to all the threads and provides the context and organization. Other people arrange things more linearly so that one thing links to another in a big chain.

I sometimes will merge a bunch of slapdash files into a single larger file to add a bit of organization. Other times, I’ve ported operations that used to be done on a spreadsheet pivot table into Python code to get rid of a manual step.

Share code between analyses if necessary

Code reuse is cool, especially if you’re writing a lot of R or Python stuff. You’ll often have small helper functions and things that are useful across multiple data projects, so there’s a need to decide whether you’d like to have a separate centralized repository for this shared code, or whether it’s better to just copy/paste it into another project so that modifications can happen.

SQL can sometimes suffer from this, but typically it’s “Safe” in that it’s, uh… rarely put into version control to begin with [link]. So a lot of work winds up being copy/paste of existing patterns and making changes.

Whatever you decide tends to add a whole new dimension of complexity to everything — juggling two repositories of code and possibly maintaining backwards compatibility with that code’s API, or face having lots of slightly-modified copied snippets of code floating in various places.

I think the answer depends a lot on your own situation.

Share results with, sometimes non-technical, stakeholders

Results are often the whole point of the work we do, and that means that lots of people are interested. These people also come from all sorts of backgrounds. In industry, the usual medium of choice for communicating to others is, for better or worse, slide decks, and so a lot of work is often put into the format.

The problem is that the pretty, clean, very tight narratives of a typical slide deck tuned for executives and decision-makers specifically gloss over a bunch of the methodological details. They’re also often shared around in different circles, sometimes even external to the company.

It becomes a challenge to link internal, private research materials and processes more broadly. So while you might be able to include a few links and descriptions in an appendix, it may be impossible to tie everything tightly together.

The only strategy I have dealing with such situations is to be extremely meticulous in file-naming and process. There’s often a risk of accidentally sending data (perhaps with PII!!!) that you don’t want to send also, so you need to develop safeguards for that.

Keep track of things outside of a project/request context

Not all our work fits cleanly into project boxes. There’s always extra ideas, hypotheses, things that need to sit for a while to simmer. We need places to file these thoughts, ideas, side explorations for later.

I personally keep a paper notepad, notebook, sometimes a stack of sticky notes on my desk to scribble ideas into. Most of them wind up being useless, but occasionally they become something more. Other people seem to maintain notebooks of various kinds for these ideas.

Another idea, which Michael Kaminsky wrote about, is to maintain a data science lab book, which keeps records of what work you’ve been doing on a daily basis. The typical format looks like:

Date: 2020-01-01
Hypothesis:
We think cheese sandwiches are best after 5pm
Conditions:
Bought 50 cheeses and bread [list] at 4pm, grilling commences. Will also test at 8am tomorrow.
Results:
We don’t need dinner any more…. crumbs everywhere
Git commit:
ba14a2ec

Such a system would span across all projects, and side excursions.

What about tools?

Most tooling that people mentioned were just the individual tools they were using for various things, the usual suspects we don’t need to go into like Email, Jupyter Notebooks and it’s close cousin Colab, RStudio, git, Google Docs/Sheets, Word, Excel, PowerPoint/Slides.

Jira and similar ticketing systems came up, many teams use similar methods to handle ingestion. A subgroup of people, myself included, also get PTSD flashbacks about Jira because the software is very complicated and heavy-handed while the UX can be confusing. Just search for “Jira hate” on the internet to see plenty of rants.

Wikis are often used, either packaged with the ticketing system (Confluence + Jira for example) or running separately. Either way, they provide an easy way for creating pages of linked material for documentation/archival purposes.

Notion came up multiple times from many people in terms of notetaking and sharing. It’s got a lot of functionality for team sharing, including wikis and other tools.

One person mentioned Knowledge Repo out of Airbnb, a massive system for sharing knowledge amongst a team of data scientists and nontech stakeholders. It works with source control repositories: there’s facilities for code, document metadata, slides, docs, etc. I’ve never used it and it seems a bit too much for my personal solo-worker use case, but it looks really cool if you can convince your entire team and stakeholders to adopt it consistently.

One person used a novel method of keeping a work journal, they used Slack. Using a channel all to themselves, they’d use the threading feature to take notes. I thought that was a very clever use of an existing tool. Exporting the chat history might actually allow archival for the future too.

Okay, but what are other fields doing?

While we don’t seem to talk about our workflow very much in Data Science, possibly because it’s too new for the dust to settle, other fields, essentially, all of science, has been doing the similar work for centuries.

The problem is that lots of “workflows” are very high level. There’s a lot trying to discuss idealized scholarly workflows, this one happens overview a number of diagrams from difference sources. This paper itself mentions that most examples are idealized workflows used by other stakeholders to understand the research process. The actual details (which is what we’re interested in) aren’t shown.

Out of the little reading I’ve managed to do, this site Innovations in Scholarly Communication by Bianca Kramer and Jeroen Bosman at the Utrecht University Library is perhaps the most useful for our interests. They’ve been asking (for years now) the question of what workflows do researchers have, and the dizzying array of tools and tool-usage clusters shows just how complicated asking the “what are people’s workflows” question is. The data is openly available for anyone to poke at it.

There’s a ton of resources on the site, like the most popular tools for specific activities, and this big list of (idealized) workflows based on Discovery->Analysis->Writing->Publication->Outreach->Assessment steps. The academic workflow differs a bit from our industry one because we have significantly smaller (or, nonexistent) peer review/outreach requirements and the “discovery” phase doesn’t usually involve a giant literature review.

Another thing that we don’t particularly think very hard about (even though we should), is the concept of data provenance and reproducibility. Other scientific fields have started building out tooling to handle such issues when industry is largely “eh, it’s our database, we do whatever the heck we want when we want.” And that’s not even taking into account 3rd party data sources.

Even this complex view of workflow is still a bit high level to get at the “how do people actually do science”. That still seems somewhat left either as an exercise to the researcher, or perhaps left to the PhD program. Since doctoral programs have a certain apprenticeship aspect to them, we can assume that some of the craft of “how to do research” is transferred informally there.

Many data scientists come from mixed backgrounds, even if they have prior research and PhD experience, the specific style of research and ways of working will take on those habits. The other data scientists who come from non-research/academic backgrounds will put their own habits together. Both those people will wind up working in the end.

So as data scientists keep working and growing our toolsets and domains, I’d expect that we’re going to wind up exactly like that big complicated research tool map. I’m not sure if it’s a good thing or not. I suppose it’ll be a good problem to have since it means we’d be crazy successful in a giant array of domains.