Counting is hard, 2019-nCoV edition

It’s not every day we can watch an expert explaining how hard counting is

Jan 28, 2020

Today, amidst the growing global buzz of the 2019 Novel Coronavirus outbreak that is currently centered in China, I spotted this wonderful twitter thread about a number: R0, “R naught”, the basic reproduction number. This term comes out of epidemiology and I’m most certainly not an epidemiologist, so I’ll try not to mangle it the concept badly.

Before we get to the thread, the simple gist of R0, from my cursory reading is thus:

R0 < 1, every existing infection causes less than 1 new infection, disease will eventually die out
R0 = 1, every existing infection causes 1 more infection, disease keeps going
R0 > 1, every existing infection causes more than 1 infection, so the infected count will grow and there’s risk of an epidemic
R0 is primarily a concept used in mathematical models to understand and describe the spread of disease. Depending on the specific model used, it includes various assumptions that will limit its real-world usefulness.
R0 is hard (if not impossible) to measure.

In a situation like with 2019-nCoV, where interventions need to be considered and acted upon quickly by government officials, healthcare workers, and even an information-hungry populace, it’s easy to think: “What is the value of R0? How hard can it be to calculate R0?”

CarolineOB @Caroline_OF_B

Lots of discussion of R0 (basic reproduction number) for #nCov2019, but maybe useful to explain how to calculate it and where uncertainties come from. R0 is context-specific, and can vary in different populations. It is not a measure of virulence (which is disease severity). 1/x

Well obviously, from the title of this, calculating R0 is hard. It’s actually a super illustrative case for people who don’t stand at the interface of data collection and modeling, and people who are looking at learning the data sciences.

What’s nice about all this is that epidemiology is a “hard science” that long predates the whole “data science” thing. It has had time for smart people to think hard about the problem, and there’s a strong motivation (in cost to human life and suffering) to attack the problem.

Diving in

The thread summarizes 2 broad models for R0, a very simple one: R0=1+growth rate * time between one infection to the next.

CarolineOB @Caroline_OF_B

2/x At the start of an epidemic, reported cases grow exponentially. We can estimate R0 statistically using this growth rate, as: R0 = 1+ growth rate * serial interval Serial interval is the time between one infection and the next in a transmission chain.

Caroline notes that it is a simple model with a tendency to underestimate R0 because unreliable reporting and asymptomatic cases are essentially missing data points that you’re guaranteed to have. The model is also only descriptive and doesn’t offer any mathematical tools to analyze what-if scenarios for intervention.

So Caroline switches focus to a simple SEIR model which models the flow of people moving from Susceptible, Exposed, Infected, Resistant. (If you want to see more different kinds of models, even the wikipedia page has gazillions of them)

In the very simple SEIR model presented, R0 = bkd, with the parameters as stated below.

CarolineOB @Caroline_OF_B

5/x In simple SEIR models R0 is: - probability of infection given contact w infectious person (b) multiplied by - contact rate (k) multiplied by - infectious duration (d) Can assume nCoV2019 params like SARS, or estimate from model fit to epi curve, & seeing which fit best

SEIR models are more useful because you can put in parameters that encode interventions like vaccinations (removing susceptible people) or quarantines (lower the contact rates as well as susceptible population). You can play what-if, which is essential for policy-makers.

But in order to estimate R0, we need to… estimate 3 parameters, b, k, and d. The suggestion being that lacking detailed information, you can estimate the parameters first starting with a similar past disease (SARS), and tweaking the params to fit your new data. That should get you reasonably close when you’re pressed for time and don’t have the data to measure things directly, like in the super early days of the outbreak.

But the end of the day, it’s still a guess because maybe 2019-nCoV doesn’t REALLY behave like SARS, because nature is unpredictable. We’re only settling for the estimate due to time constraints. So we’d like to actually measure the real parameters as data comes in.

Then the thread goes on into the many many hurdles you’ll encounter as you set off to really measure the parameters. If you want to estimate k, the contact rate… you need to guess how many people a typical person has been in contact with? How many did you contact today? It’s also been demonstrated for past diseases that it’s not the same for every person, so it’s a distribution. The number of people susceptible to the disease is also a changing parameter as people potentially get sick and recover.

It’s noted that SEIR generally assumes the whole population has the same infection risk, so you can consider the unit of study to be a single city. You could then expand the model into an epidemic model between cities… but then you’d have to model travel between cities and even countries (yay, more parameters to estimate!), and those places are likely to have different parameters for b,k, and d!

Ultimately you read all this and it dawns on you: the absolute estimate of R0 has so many caveats attached to it, it’s sorta silly to take it has a hard fact.

Models are just the framework for communication

CarolineOB @Caroline_OF_B

16/x Overall, because so many unknowns, these simple models make sense during an outbreak vs very complex ones. They are easily reproducible and transparent in their assumptions, creating a framework for collaborative oversight and open discussion of results.

At the end of the day, R0 is just a number. People aren’t calculating R0 because the number itself tells a lot about the disease, because it doesn’t. The actual useful understanding comes from the work done to calculate R0.

People are using it to lay out the assumptions, to show where we should be collecting data, where our error sources are, and finally what might happen if we make certain changes.

It’s also good to note that the model itself is simple enough to sensibly describe a simplified version in a handful of tweets. The hard (and potentially dangerous) work is in the data collection parts. We should remember that in our much safer, tamer, daily data science work.

Counting Stuff

Counting is hard, 2019-nCoV edition

It’s not every day we can watch an expert explaining how hard counting is

Diving in

Models are just the framework for communication