The utterly gorgeous Library of Parliament in Ottawa, Canada.
It’s both a fun blessing, and also a vexing puzzle, when readers challenge me to write about certain topics. I do encourage people to toss requests at me, since it gets me out of my own headspace. This week’s post began with this monster of a request:
Being someone what works in that particular field, Jenna here obviously knows this is a MASSIVE, continent-sized, topic. Sadly, I’m not informed enough to tackle the entire continent and I’ve only got a few thousand words of to use in a post. To cope, I’m gonna employ the time-honored move of harrowed data scientists everywhere — reduce scope by fiat and attempt to dazzle the audience with “directional” findings until I can form a strong case later on.
Preface and disclaimers
Before I run my mouth/keyboard on the open internet, let’s get some disclaimers out of the way. I do NOT think I’m particularly good at writing documentation, it’s a different, more thoughtful, process than writing snarky blog posts. At best I have a very healthy respect for how difficult it is and thus when I’m forced to write docs, I have to put a lot of effort into them and they still come out fairly “meh”.
I work next some extremely talented UX writers who spend a lot of time and energy thinking about text within products holistically, from terms, labels, tooltips, explanatory panels, error messages, all the way to “the docs”. (If any of you are reading this, y’all rock!). I can only see parts of their process and the end results, but it’s something I can only aspire to make a crude imitation of.
Writing documentation is “hard”, but recognized as valuable
Very few people actually enjoy writing documentation. It often feels like “extra” work on top of actually building things. It’s also a skill that needs to be developed and refined with practice, and many people realize that writing is a craft to be honed instead of a magical innate skill.
At the same time, despite all the pain and griping surrounding documentation. Everyone recognizes that it has SOME kind of value. Think back to all the instances where we look back at our own code and analyses to answer a question or make a change and we get frustrated that some idiot had put gibberish with no explanation down.
Speaking more broadly, how else can people learn how to use complex tools unless some kind of manual existed? Every developer relies on detailed reference documentation of APIs and libraries to do their work. We all rely on the vast community of blog posts, tech talks, code comments, Stack Overflow, YouTube videos, and podcasts to absorb the knowledge of the technology to learn how to do stuff. The alternative learning method, painful trial and error and directly reading source code is so inefficient that it's almost unthinkable.
So people do instinctively know that they should be doing some form of documentation, but they aren’t motivated to do it because it takes too much time, is too much work, and isn’t important enough.
“Documentation” is too overloaded a term
While doing (insufficient) research on documentation, it was glaringly obvious that we humans use the term to cover way too many use cases.
I saw a bunch that broke out docs into four broad classes, an example (one of many similar ones) being this article. These posts usually break the topic into 4 big domains: tutorials, how-to guide, explanation, and reference. This is based on whether the docs are learning focused, or problem-oriented, or help in understanding or providing information.
Those descriptions were a bit too abstract for me to understand, likely because I haven’t thought sufficiently deeply about the topic. Instead, I started seeing what people wrote on the topic of “why is writing documentation so hard?” and came across this this article I really liked: “Why Internal Documentation is Hard” by John Teasdale. It sat down and wrote out that the ultimate purpose of technical documentation falls under the umbrella “transferring technical knowledge”. It exists within the giant constellation of artifacts that transfers knowledge on how to use/maintain/operate/do-whatever to a piece of technology. This means that documentation is asked to fulfill a huge amount of tasks.
John’s article lays various technical docs use cases out in more detail, but just cherry-picking a few of the big ones for their relevance to data science work, we get:
Onboarding new people - “How does this thing work?”
Listing out the entire breadth of an API - “What’s everything I can do?”’
Explaining why use something, what it’s for - “When should I use this?”
Show example usage and patterns
How to troubleshoot, fix something
How to tune/optimize performance
And much more other stuff…
Even with this abbreviated list of major use cases, it’s clear that asking any single document to cover all of that is REALLY difficult. Very often, the use cases are handled by different documents in different styles and formats — tutorials, blog posts, API references, books, written by different people in different contexts.
Off the top of my head, only very mature or well-funded projects have documentation that can cover a bunch of those use cases in a unified location, like the FreeBSD manual, a lot of the documentation that comes out of Microsoft. Essentially, it takes a lot of time, energy, and resources to make good docs that cover all the bases.
That means that it’s very likely NOT going to happen for your 3-week long data science project, nor is it likely to exist for a v1.0 release of a product from a startup. The up-front cost is too high.
So, instead of beating ourselves up about how we should, but yet fail to, write documentation for our stuff. Let’s be intentional about what little we actually do so that it’s useful and relatively painless for us.
Local terminology: Recordkeeping vs Documenting
In this post, I’m going to make a terminology distinction for clarity. I want to make a distinction between “recordkeeping” and “documenting”. I’m sure the concepts exist out there, but I’m not 100% sure if this distinction exists out in the rest of the world with these exact terms so I’m just going to say I’m making it up here.
Recordkeeping is keeping records, notes, and artifacts around to help in creating documentation in the future. It’s done by the people involved in doing the actual work saving things and making clarifications that only they really understand, with no expectation that someone else needs to understand everything. The material collected is primarily intended for use to write the actual documentation. Think of it as preserving raw data for future use.
Documenting is the act of writing documentation — material that’s intended for consumption by people who weren’t involved in the work, or have since forgotten what the work was like.
I’m making this distinction now because I’m going to advocate for a lot more recordkeeping and a lot less documenting, and the two acts are qualitatively very different.
Being intentional: Figure out what documentation use cases you need to fulfill, do those, let the rest go
For a large amount of data science work, there’s a relatively very limited scope of documentation needed. Much of it is tied to the question of “will I ever look at this work again?”. The lower that probability, the more we can get away with documenting LESS (but not none). The requirements only start increasing when the expected audience expands and the need to look back stretches further and further.
Think of it like a woodworking project. You want to spend the most energy on all the surfaces that people can see and touch, making them presentable and perfect. But for areas that are hidden away and never to be seen, it’s OK to not be perfect so long as it’s done effectively. Even better, unlike a wood table, we can always go back and improve the unseen areas if suddenly people start looking at them.
Preparation FOR documentation is key
The most important part about writing any documentation is realizing and accepting that we’re going to have to do this work, so it’s best to plan for doing it early. If you know ahead of time the sorts of documentation you’re going to need before work starts, it lets you pay attention to your recordkeeping while the work is happening. Those records can then be refined to create parts of the documentation and save you time in the long run.
For example, when designing a system, a lot of diagrams and notes are usually created, modified, and then discarded. Since you’d need the logic flow diagram later to explain to other people, you’d save the most recent one for that purpose instead of just erasing the whiteboard at the end of the meeting.
Similarly, during design and implementation, lots of little decisions are made about details. Those small details are often forgotten soon after and a third party would be unlikely to know why something was done in a certain way. Noting this stuff down, even in a quick comment, helps with recall if that bit of information becomes necessary in the future to create docs.
You’ll have to find the right balance for note-taking as you work. I learned quickly in my dogfooding project that it’s EXTREMELY distracting to make detailed journal notes while trying to work out a programming issue. I tried to write down every step and thought experiment but it wound up being too much and settled for putting down summaries of the though process there. Obviously that level of detail is not required to write documentation, I needed it for the research project, but it made the formal presentation/documentation of the project so much easier afterwards.
You have to figure out a balance that works for you.
Good prep makes backfilling docs easier
Since we’re only planning to write docs that we know we will in the future, there will be gaps. The most common time we realize there’s a gap is when someone leaves, or joins, the group. There’s a giant moment of “OMG, Alice is leaving in 10 days and knows too much stuff no one else has touched!” or “Bob is joining and we have no onboarding/training/ANYTHING so what will their first week look like?”
This sequence of events probably sounds familiar to you, and sounds a bit unsatisfyingly slap-dash. Why didn’t we have this stuff ready to go beforehand?
Well, we didn’t have it because we were too busy building and launching new features instead of documenting. We accepted that debt (hopefully) willingly.
I don’t think it’s that big of a deal. I’m fine with big events like those two being a forcing function that's needed to get certain types of documentation written. It’s all stuff that the organization can live without (because, it has been doing so). It’s be a waste of effort to constantly maintain onboarding docs if no one is being hired for long periods of time.
So long as records are kept close together within easy reach, like say… attached/linked to the project files, it should be possible to use them to build out more complete documentation when the need arises.
Examples of common tasks
Many analyses have a one-off nature to them, they answer a pressing question now. Over a decision has been made, the analysis may never be relevant again. So in terms of docs requirements, things start off fairly simple:
Give anyone reading the analysis blind (including your future self) the context needed to understand it.
That pretty much just calls for a handful of needs:
Carry the context of the analysis — Why was it done, what’s the research question it’s trying to solve
The findings and conclusions from the analysis
How the analysis was done — the mechanical data transformations applied to come to the result, ideally with the result
The context and results are often put together in the formal results presentation decks, The research question and context needs to be shown to people along with the results there. Since we usually need to make these anyways, we get them “for free” in terms of effort.
Meanwhile, the mechanical data transformations to do the analysis, the SQL queries, data cleaning transformations, the filters, custom bucketing logic, pivot tables, etc. are what comprise #3. It’s “the work” we do. This is the part that takes some effort and planning because it requires good recordkeeping habits.
We need to remember to save queries and steps along the way, then add in appropriate comments to explain WHY something is being done. Note down what’s exploratory (and could be ignored) and what’s on that final critical path to the results. Then we need to make sure to pack everything together and link it with the results so that anyone looking at one can find their way to everything else.
Throughout this, it’s very tempting to just “get things done” and forget to keep intermediate records. Pausing to record things while working can break your flow, which sucks and is why people don't like writing docs. So instead of fighting this natural tendency, I'd recommend just tacking on a break at the end of a work session to lightly append quick notes while it's fresh in our minds.
Experiments are a step up in complexity, they’re sorta one-off-ish, but involve more people and take longer timeframes. Experimental results also (potentially) can have longer-term effects and can teach lessons way out in the future, so the value of documentation can be higher… if you want to invest in it. I admit I don’t do this well.
Experiments are like a one-off analyses, but must allow non-technical/analytical stakeholders to follow along and feel confident things are correct. Ideally, docs should also let future research work refer and build off it
Experiments need to be followed by tons of people of every background imaginable — engineers, product, executives, researchers, other analysts. This varied audience might not care about every detail, but collectively they need to know a ton of things. It’s a challenge to make sure everyone’s information needs are met:
Why is the experiment run, what’s the key treatment
How is the experiment run, what’s measured, what’s not
The end result, what’s decisions were made, what was learned in the process
All those choices need to be recorded down when they’re made so that they can be incorporated into more formal things in the future. You’re definitely going to be asked about those details so there’s no point putting it off and making it harder for yourself.
But when it comes to making “formal docs”, between people preferring to be briefed verbally at meetings, different people preferring different formats (slides vs documents vs dashboards) it seems impossible to keep everyone happy. Also, people usually don't like to read if they can get it verbally, which creates a disincentive to document anything in that way.
So instead of doing a lot of work that might get thrown away, I’d recommend that brief updates and explanations be noted down in an artifact on a regular basis, just so there’s an actual trace of what is happening (it also allows people who aren’t physically present to keep up). But then don't do too much more until the very end when it's time to end the experiment and finalize decisions.
Oddly enough this is where large organizations, with all their lumbering bureaucracy, has an edge over nimble startups. All that bureaucratic overhead forces this work to be done, while a nimble startups would just gloss over everything in a quick verbal meeting and move on.
Then, for the long term, experiments should somehow be archived for future reference. Because, in theory, a well-designed experiment is testing an idea, like “do users care if we have “Add to Cart” text vs just an icon”. It’s a tiny sliver of knowledge. That knowledge might be robust across a bunch of conditions, so you can apply it to other things in the future. People can only do that if this sort of research is archived in a way that people can find it later.
It’s REALLY DIFFICULT to maintain such experimental archives. This is the stuff that library and information science folk work hard at figuring out. There’s metadata collection involved, organization, information retrieval, discovery issues, usability. Researchers also have to integrate it into their work process.
It takes a lot of organizational stamina to make all that work. There’s tons of ways where these systems fail and history is littered with the corpses of such projects. But if you ever manage to get one, it can be an amazing resource.
Setting metrics is important work. It defines where whole groups of people will focus their energy and resources. As data scientists, we often do a ton of work to make sure that important metrics are actually important. Thanks to the importance of this work, it’s almost required we have some documentation prepared.
Metrics work needs to be clear, easily understood, and remembered by everyone. They should be backed up by research work that interested people can leverage.
This splits the minimal needs of documentation down into two tiers:
Very clear, memorable explanation of what the metric is. Ideally a single sentence repeated often. Bonus points if it also hints at things people could do to move the needle.
The collection of one-off analysis used to justify and define the metrics. What factors matter, why these segments, over what timeframe, etc.
#2 is pretty easy, assuming you’ve been documenting the one-off work already. It all just needs to be made accessible and clear to broad readership. Plan for doing this work ahead of time.
The tricky part is #1. It can be very difficult to find a clear, simple statement to describe a metric (though you could argue that a metric that can’t be described easily isn’t a good one…). It’s especially hard to find one that’s transparent enough that people can intuit ways to move the needle.
Yes, these needs aren’t “documentation” in the traditional sense. It’s essentially forcing a very specific piece of knowledge into organizational memory instead of merely documents.
Data pipelines and live models
Now let’s look at stuff that’s implemented in actual code, intended to live a long life in production. Things look a bit different here.
Now it’s less about justifying the analysis decisions and more about making sure our future selves and colleagues can understand how things work together and how to maintain/update things. The audience is mostly technical folk, with some high-level stuff for non-technical folk.
Since we're mostly interested in keeping the pipeline operational and potentially extending it in the future, our needs are very different. It looks more like programming documentation than the analysis documentation we’ve been doing.
Explain what the mental model is. How should a reader think about the system to understand it quickly.
What are the moving parts and their associated functions
Places that are likely to fail, what to do when they fail
How do you maintain the thing? Update/add features? Scale it out? Where can pieces be completely replaced?
Now, the initial reaction would be to try to solve all this with in-code comments and “self-documenting code”, because pipelines ARE code. This seems to be the most limited amount of work to get the most of the job done. To a point, that’s true. You can certainly explain most of the moving parts and a fair amount of the mental model within the code using comments, assuming you keep it updated.
But eventually if you expand the scope of knowledge transfer even just a bit, you hit a point where there's more explanation than code. Covering topics like maintenance, upgrades, troubleshooting, transmitting the mental model to people unfamiliar with the code, etc. None of that belongs directly in the codebase.
The exact form all this extra documentation can take varies quite a bit, you can have design documents, wiki pages, playbooks, architecture diagrams, it's a whole host of things to answer all the questions people may have. All this stuff would need to be maintained, which is part of the cost of running the pipeline.
So ideally, we want to minimize all this work. Keep the initial design artifacts around since they can give a lot of the historical design context. Be good about commenting the code. Add in all the stuff necessary for the playbook in case something goes down. This is the bare minimum. Then fill in the rest as needed, like when someone joins the team, or a handoff happens.
But if you intend on backfilling documentation as it's needed, you're going to need to plan for it.
There will always be docs-debt
I think that, in a world where there’s always incentive to launch a new feature over documenting an old one, these trade-offs will always have to be made. A new feature that has “sorta okay” docs can still be usable enough to bring in some customers.
Thorough docs upon launch are only required if you’re working on a very specific product that caters to a specific audience, like if you’re selling a complex product that engineers are going to be using. Most products in the world get away with so much less, even “enterprise” ones.
Since we’re going to be saddled with debt anyways, then I feel we really should be very deliberate about where we use our credit card. Some docs-debt may never need to be paid, or they get taken up by someone else in the community.
About this newsletter
I’m Randy Au, currently a quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. The Counting Stuff newsletter is a weekly data/tech blog about the less-than-sexy aspects about data science, UX research and tech. With occasional excursions into other fun topics.
Comments and questions are always welcome, they often give me inspiration for new posts. Tweet me. Always feel free to share these free newsletter posts with others.
All photos/drawings used are taken/created by Randy unless otherwise noted.