Counting is hard, 2019-nCoV edition
It’s not every day we can watch an expert explaining how hard counting is
Today, amidst the growing global buzz of the 2019 Novel Coronavirus outbreak that is currently centered in China, I spotted this wonderful twitter thread about a number: R0, “R naught”, the basic reproduction number. This term comes out of epidemiology and I’m most certainly not an epidemiologist, so I’ll try not to mangle it the concept badly.
Before we get to the thread, the simple gist of R0, from my cursory reading is thus:
R0 < 1, every existing infection causes less than 1 new infection, disease will eventually die out
R0 = 1, every existing infection causes 1 more infection, disease keeps going
R0 > 1, every existing infection causes more than 1 infection, so the infected count will grow and there’s risk of an epidemic
R0 is primarily a concept used in mathematical models to understand and describe the spread of disease. Depending on the specific model used, it includes various assumptions that will limit its real-world usefulness.
R0 is hard (if not impossible) to measure.
In a situation like with 2019-nCoV, where interventions need to be considered and acted upon quickly by government officials, healthcare workers, and even an information-hungry populace, it’s easy to think: “What is the value of R0? How hard can it be to calculate R0?”
Well obviously, from the title of this, calculating R0 is hard. It’s actually a super illustrative case for people who don’t stand at the interface of data collection and modeling, and people who are looking at learning the data sciences.
What’s nice about all this is that epidemiology is a “hard science” that long predates the whole “data science” thing. It has had time for smart people to think hard about the problem, and there’s a strong motivation (in cost to human life and suffering) to attack the problem.
Diving in
The thread summarizes 2 broad models for R0, a very simple one: R0=1+growth rate * time between one infection to the next.
Caroline notes that it is a simple model with a tendency to underestimate R0 because unreliable reporting and asymptomatic cases are essentially missing data points that you’re guaranteed to have. The model is also only descriptive and doesn’t offer any mathematical tools to analyze what-if scenarios for intervention.
So Caroline switches focus to a simple SEIR model which models the flow of people moving from Susceptible, Exposed, Infected, Resistant. (If you want to see more different kinds of models, even the wikipedia page has gazillions of them)
In the very simple SEIR model presented, R0 = bkd
, with the parameters as stated below.
SEIR models are more useful because you can put in parameters that encode interventions like vaccinations (removing susceptible people) or quarantines (lower the contact rates as well as susceptible population). You can play what-if, which is essential for policy-makers.
But in order to estimate R0, we need to… estimate 3 parameters, b, k, and d. The suggestion being that lacking detailed information, you can estimate the parameters first starting with a similar past disease (SARS), and tweaking the params to fit your new data. That should get you reasonably close when you’re pressed for time and don’t have the data to measure things directly, like in the super early days of the outbreak.
But the end of the day, it’s still a guess because maybe 2019-nCoV doesn’t REALLY behave like SARS, because nature is unpredictable. We’re only settling for the estimate due to time constraints. So we’d like to actually measure the real parameters as data comes in.
Then the thread goes on into the many many hurdles you’ll encounter as you set off to really measure the parameters. If you want to estimate k, the contact rate… you need to guess how many people a typical person has been in contact with? How many did you contact today? It’s also been demonstrated for past diseases that it’s not the same for every person, so it’s a distribution. The number of people susceptible to the disease is also a changing parameter as people potentially get sick and recover.
It’s noted that SEIR generally assumes the whole population has the same infection risk, so you can consider the unit of study to be a single city. You could then expand the model into an epidemic model between cities… but then you’d have to model travel between cities and even countries (yay, more parameters to estimate!), and those places are likely to have different parameters for b,k, and d!
Ultimately you read all this and it dawns on you: the absolute estimate of R0 has so many caveats attached to it, it’s sorta silly to take it has a hard fact.
Models are just the framework for communication
At the end of the day, R0 is just a number. People aren’t calculating R0 because the number itself tells a lot about the disease, because it doesn’t. The actual useful understanding comes from the work done to calculate R0.
People are using it to lay out the assumptions, to show where we should be collecting data, where our error sources are, and finally what might happen if we make certain changes.
It’s also good to note that the model itself is simple enough to sensibly describe a simplified version in a handful of tweets. The hard (and potentially dangerous) work is in the data collection parts. We should remember that in our much safer, tamer, daily data science work.