It’s been a crazy busy week for me with finishing a gem to send to someone and building and setting up a shiny new desktop, all while a 2yo is zooming around demanding mom and dad accompany them everywhere. So I’m slacking a little by lightly covering an article in Nature that caught my eye this week. “Unrepresentative big surveys significantly overestimated US vaccine update” by Bradley, Kuriwaki, Isakov, et al..
The TL;DR of the article is that the researchers compared a number of surveys held during the pandemic that was trying to estimate the same question — how many people got a first dose of the COVID-19 vaccine. They compared three online surveys — Axios-Ipsos, Census Household Pulse, Delphi-Facebook, and most critically, compared the results of the surveys against the official CDC statistics for vaccination as a source of ground truth to compare against. With that setup, they can see how well the surveys track against the CDC data.
The interesting finding is that the Axios-Ipsos (AI) survey, which only surveys 1k people every wave of sends, was much more accurate (accuracy here is performance compared to the CDC official vaccine statistics) compared to the 75k for Census Household Pulse (CHP) and 250k for Delphi-Facebook (DF) which consistently over-estimated vaccine uptake by large margins.
The AI survey included the ground truth value in all but one of the 11 survey waves included the CDC statistics within the survey’s 95% error bars. Meanwhile the huge sample sizes of the two latter surveys allowed the confidence intervals to shrink to effectively nothing, but the estimate was still way off base. The authors call this phenomenon as an example of the Big Data Paradox, citing Meng (2018).
Despite the huge amount of data that was collected by the CHP and DF surveys giving very tight confidence intervals, the bias within how the data was collected and analyzed had dominated the final results.
I find this to be a very stark reminder to all us data practitioners out there with access to thousands, millions, perhaps even billions of potential study subjects. Bigger samples aren’t necessarily better samples, and we typically do not have an objective ground truth statistic to compare our experimental data with to get a measurement of how much error exists.
If most of us were honest with ourselves, we have an intuition that this was a fundamental truth to our work — there are often jokes about having infinite statistical power and all - but it is still very surprising to see a study design that specifically highlights just what the magnitude of error can look like.
Going deeper into what happened
On the surface if you example the survey design tables, they surveys are very similar. They ask questions that are worded quite similarly, sent during similar timeframes, all trying to get at the same research question (vaccine uptake). The primary differences are in the sampling population and methodology.
Inside the rather brief article (the majority of the 20 pages is actually extended tables and supporting materials), the authors lay out a mathematical framework for understanding data from Meng, 2018:
Total error = Data quality defect × Data scarcity × Inherent problem difficulty
In this breakdown of error, Data scarcity is term we’re familiar with, related to the sample size (but relative to the population, not the absolute value of n). Data quality defect is characterized by essentially the bias of the experimental setup. Finally Inherent problem difficulty is related to the standard deviation of the population (the more heterogeneous the population, the wider your error bars will have to be even in ideal conditions). The methods section in the paper has a more technical treatment with formulae for those interested.
For a typical data scientist with a relatively shallow understanding of statistics (I’ll fall under that label), they might have learned about error in the context of basic inferential statistics and thus primarily frame error in the context of sample sizes. After all, confidence interval calculations essentially depend on sample size because standard deviation is fixed for the population in question. The other critical assumptions needed to construct confidence intervals as intended are at best an afterthought.
The treatment of bias is either taught in more advanced methods class which fewer people take, or learned on the job via painful experience and learning under more experienced practitioners. Even if we’re aware about biases in our experimental designs and measurement systems, it’s often not given enough thought. Moreover, it’s very rare to find a way to directly measure the bias of our work against a ground truth, so many will never learn just how important it is.
So what went wrong and how did we avoid the issue?
The short answer as to why the big surveys had more error than the small survey was because the sampling was unrepresentative of the US population at large. That’s really all there is to it. No magic bullets involved.
“… mathematically, when a survey is found to be biased with respect to one variable, it implies that the entire survey fails to be statistically representative. The theory of survey sampling relies on statistical representativeness for all variables achieved through probabilistic sampling.”
While all the surveys did do some reweighting of responses (as is good practice), there were still big gaps between the big surveys and the population.
For example, Delphi-Facebook surveyed Facebook Active Users, but did not weight responses based on education — the result was people without a college degree (also a group that was less likely to get vaccinated) were underrepresented by nearly 20 percentage points. Race/ethnicity was also not weighted, so whites were overrepresented by 8pp and the proportion of Black and Asian populations were almost half what they should’ve been.
And before you think that merely adding those weights would fix the issue, the Census Household Pulse survey did reweighting on race and education, but still overestimated vaccine uptake. It performed better than DF, but their sampling still wasn’t representative enough.
Meanwhile, the Axios-Ipsos survey took pains to get a representative set of panelists for their surveys, even going so far as giving 21 (1%) of their panelists who did not have internet access computers and internet connectivity. Such a program was no doubt a costly effort, but the increase of data quality.
Inevitably these panels are assembled as part of a long term survey pool that are reused for multiple studies over time to justify the setup and maintenance costs. A few years ago I was once part of a Nielsen TV panel for their TV ratings thing and it was super fascinating as a data person to participate in their process. There were lots of checks and crosschecks to collect statistics, demographics, and make sure I hadn’t been influenced. A story for another day if there’s interest.
Hold up, why is the CDC vaccine statistics the ground truth? Can’t THAT be biased?
Sure, and the authors respond thusly:
The accuracy of our analysis does rely on the accuracy of the CDC’s estimates of COVID vaccine uptake. However, if the selection bias in the CDC’s benchmark is significant enough to alter our results, then that itself would be another example of the Big Data Paradox.
Data scientists, pay attention to what you’re measuring!
For the majority of data scientists, our work isn’t trying to make statements about the population of the US, so they would likely think to brush this article off with a “well, I’m not doing that”. For example, the population in question for my day-to-day product work is usually “people who use cloud tech”, which is definitely a different beast. Other people might work on ultra convenient populations where the population in question is exactly the registered userbase.
But guess what, even within different populations, sampling issues can still screw us all over! I can easily run a survey that pulls a non-representative cross section of cloud users, someone could easily sample their whole population with a weird bias. There is absolutely no free lunch here.
About this newsletter
I’m Randy Au, currently a Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. The Counting Stuff newsletter is a weekly data/tech blog about the less-than-sexy aspects about data science, UX research and tech. With occasional excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise noted.
Curated archive of evergreen posts can be found at randyau.com
Supporting this newsletter:
This newsletter is free, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Tweet me - Comments and questions are always welcome, they often inspire new posts
A small one-time donation at Ko-fi - Thanks to everyone who’s supported me!!! <3
If shirts and swag are more your style there’s some here - There’s a plane w/ dots shirt availble!