Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.
Growing up in NYC, I've largely developed a vague apathy towards air quality. It’s a given that our air isn't pristine — we have far too many people and cars and polluting industry nearby for that ever to be true. My house number sign uses magnets to attach the numbers and faint shadows appeared on the numbers over the magnets after a year because I assume car brake dust blew by and stuck to the magnets.
So I mostly anchor the air I breathe as being “not great, but not infamously dangerous LA smog or wildfire bad”. And then of course things changed last week when giant wildfires in Canada blew south and turned the air a distinct shade of barbeque yellow for a day.
And in the grand tradition of New York based media writing about stuff as if it's some new thing despite whole other parts of the world are long familiar with it, I was curious just what went into the Air Quality Index (AQI) that is published by the US government.
The AQI is clearly the result of a question that data scientists (and many other scientists in general) get asked very often — take this very complex and nuanced concept, like “air quality”, and reduce it to a single number that laypeople can use in their daily lives to make important decisions with.
Air quality is obviously a complex concept. There's of course the amount of particulate matter in the air, both large particles like household lint and dust and microscopic particles like soot. Then there's various pollutants like ozone and nitrogen oxide. And what about straight up deadly poisons like chlorine and cyanide? All of these things in varying proportion needs to somehow get represented in a concept of “air quality” — and you only get one number to express it.
To some extent, the job of the AQI is simplified (somewhat) by the fact that the resolution to air quality problems tend to be very similar — if the air is bad, take measures not to breathe it without appropriate protection. Ideally, just don't breathe the bad stuff at all and find clean air. I suspect that if the different pollutants required significantly different actions, having a single number would make no sense.
The basic AQI concept
So AQI is a single score that encompasses 6 pollutants. The specific pollutants considered are as follows: Ozone, PM 2.5 (particles with size under 2.5 microns), the PM for 10 micron particles is there too. Then there’s the chemicals: carbon monoxide, silicon oxide, nitrogen dioxide, and sulphur dioxide.
The scores range from 0 (effectively no pollutants at the monitored unit of concentration) to 100 (the point where it is considered unhealthy for sensitive groups) and caps out at 500 (considered hazardous). If things get worst than the level for AQI 500, it’s still possible to report it as being “Beyond the AQI”. At that point it’s out of scope of the scale and all you really know is that it’s hazardous to health. An AQI value can still be computed to indicate relative magnitude using the same linear relationship of concentration to classification that is used for the Hazardous category. This lets people say that the air is 2, 3, or 50% worse than the highest AQI value.
Time frames are important
The Air Quality Index is based on daily air quality summaries, specifically daily maximums or daily averages. It takes a full 24 hours to obtain an AQI value (that’s 24 hourly values for PM or the max 1-hour or 8-hour value in a 24-hour period for other pollutants). The EPA says that it is not valid to use shorter-term (e.g. hourly) data to calculate an AQI value. That means there’s as much a whole day lag between the official AQI and whatever is floating around in the atmosphere right this moment.
Having a metric that tells you what the air was like yesterday isn’t particularly useful unless you’re willing to assume that tomorrow will look much like today. So to help alleviate this issue somewhat, the EPA does actually encourage people to make AQI forecasts. Such forcasts will let people actually plan their days. The EPA published guidelines for developing a forecasting program and, just glancing at the document, it is intense. It’s packed with information about the individual pollutants and how they’re generated and their properties. It’s probably worth looking into some day in the future because it seems very similar (and possible better thought out) than typical data science work.
The alternative to forecasting or using yesterday’s AQI involves a methodology that the EPA now called “NowCast”. NowCast makes use of short term data to make more timely reports about air quality. I don’t have the energy to dig into NowCast today, but here’s what the EPA has to say about it:
EPA uses the NowCast to approximate the complete daily AQI during any given hour. Even on days when the AQI forecast predicts unhealthy conditions, pollution levels may be lower and better for outdoor activities during some parts of the day. Providing current conditions gives people the power to take action to reduce outdoor activities and exposure when necessary and protect their health.
The NowCast calculation uses longer averages during periods of stable air quality and shorter averages when air quality is changing rapidly, such as during a fire. The NowCast allows current conditions maps to align more closely with what people are actually seeing or experiencing.
Calibration to 100 and scaling
So the AQI is an arbitrary scale where 100 it set to be (from what I understand) the levels of pollutant where it becomes “unhealthy sensitive groups”. The definition of sensitive group differs depending on the pollutant. Most include people with hearth or lung disease, but people active/working outdoors are specifically mentioned for Ozone, while people with asthma are called out for nitrogen and sulphur dioxide.
For each pollutant, there’s a table defined that says “for concentrations between X and Y levels, it is this AQI value”. These X and Y “breakpoints” form the basis for the AQI system.
Every pollutant gets a series of breakpoints, let’s say 0, 5, and 10. Those breakpoints are then mapped onto a range of AQI values, in our example, 0 to 5 concentration units represents AQI 0 to 50. Breakpoints 5 to 10 concentration units will represent AQI 50 to 100. You can see the various breakpoints as follows.
Between breakpoints, the AQI values map linearly with the measurements. So in our example, a measured value of 2.5 concentration units would get an AQI of 25.
Note: the methodology also includes specific instructions on over what time period the measurements need to be done, for example highest in 24 hours, average over the past 8 hour window, etc. Each reflects the unique nature and properties of each pollutant.
Hey, data people! Linear relationships are very nice and easy to work with, but surely we don’t believe that things like ‘concentrations of airborne pollution’ follow a strictly linear relationship starting from nothing to ‘dangerous to health’. I’m sure we’re especially skeptical that a linear relationship could possibly exist for 6 different metrics at once. So I took the lower ends of all the thresholds and quickly threw them into a plot.
And so you can see that, to some degree, different pollutants may curve differently. This is because of two factors. The first is that the AQI categories start off at 50 units per category (0-50, 50-100 etc) but starting with “Very Unhealthy”, each category covers 100 units. So the slope of the lines would double at that exact point if nothing actually changed. You can see some evidence of this for many, but not all, of the charts.
Instead of being simple curves that double in slope halfway through, the curves all bend very uniquely, some more, some less, some sooner, some later. The breakpoints for the AQI have been “tuned” in a way to account for at least some of the non-linear relationship between pollution and human body responses.
This is why the AQI caps out at 500 and then all reports are “Beyond the AQI”. Because while the people who designed the scale tried to account for non-linear effects to some extent within the breakpoints by squishing and stretching the ranges to line up with heath outcomes, those little piecewise linear chunks are not going to extrapolate very well against what is likely an exponential system.
Boiling it down to just one number
Thus far, we’re still dealing with potentially 6 numbers of largely unrelated pollutants that can do whatever they want. How can we summarize it into a single number for people to use?
Now, the typical move that a data scientist would make is to create some kind of function that takes all 6 pollutants as inputs and generate some kind of score. There’s infinite ways to generate and tune functions to “do what we want” here. The developers of AQI just used….
max()
Yes, just take the AQI value of the highest, nastiest pollutant and that’s the AQI.
As I mentioned early on, this clean and simple way just works because no matter which pollutant winds up being the worst, the remedy is essentially the same — stop breathing the bad stuff. Once you’ve put all the pollutant concentrations on the same scale normalized to 100, all that matters is something is bad enough to avoid. All other details are effectively trivia.
Considering how many times I’ve had to argue and wrangle with scoring functions in frustrating attempt at juggling conflicting signals, weird edge cases, and pure bullheaded argumentation from stakeholders with axes to grind… I don’t think I’ve ever managed to elegantly convert a set of inputs into a single, clear, effective, usable score. That’s why I’m just a random guy sending blog posts to people’s email boxes while those folk are setting the air quality report standards for a country of over 320 million.
Completely unrelated but I happened to enjoy listening to this lecture on writing while I was writing this… because somehow having more distractions is… good? I dunno.
Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.
Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.
Support the newsletter:
This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Tweet me with comments and questions
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!
Wow. I had no idea. But by taking the max, they ignore all synergistic effects. That seems bad. Of course, they also ignore "natural pollutants", i.e., pollens and spores, which also have synergistic effects with pollutants like ozone. It's an impressive piece of work, but I'd say there is still a lot of scope for improvement.
Cool read! Like you mentioned, I always wondered how AQI can be represented by just a single number. But it was interesting to know how the simplicity perfectly suited the use case here as to how it suffices to just know the degree of how good or bad the air quality is and that we don’t need to go all crazy on coming up with a fancy methodology for super accurate point estimates.