Want a DS project? There's health insurance data out there
And no (public) resources for working with the data yet
Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.
Last week, I saw a tweet about how insurance companies, compelled by law in July to publish their price data, were dumping tons of data in a flood of information that was night unusable. Terabytes of data, in shoddy formats in what’s likely a case of malicious compliance? From multiple vendors? Sounds like something Data Twitter would see as an interesting challenge.
I shared that sentiment on twitter and by the looks of it decent amount of people seemed to agree with it.
The post links to a blog post by Alec Stein from last week that briefly goes into the situation, but more importantly links to a GitHub repo with some web scraper code that can go to a number of major healthcare insurance company sites and sift through the metadata of the files themselves.
Incidentally, the tweet sorta implies that the volume and near incomprehensibility is the result of malicious compliance on the part of insurers. While there might be that going on in the background, due to how the data structures were specified, I honestly think it’s primarily a mix of “eh, this is sufficiently compliant, onto my next ticket” and shoddy data structure design than actual malice. But more on that later.
Plenty of people looking from the sidelines think this can be handled easily, it’s not
As with anything on the internet, there are many viewpoints to everything, and one such group thinks the problem would “easy” to solve by a collective group of programmers. Sometimes it’s dismissing the unwieldy size of the data (“Oh, just 100TB of JSON? Pssht”). Sometimes just dismissing how analyzing the cost of procedures is somehow trivial. We can safely ignore all these folk because this is most definitely a non-trivial problem that will take a lot of effort to tackle.
Luckily, most people in my responses were much more sober. The general sentiment is that the data storage and analysis technology portion is relatively easy to solve. There’s plenty of cloud infrastructure available that can easily handle petabytes of data if your wallet is deep enough. There’s also plenty of data warehouse/database options for dealing with datasets of that size. IF people can find a way to put this data into a query-able state in some data warehouse, data scientists everywhere can go to town on the data and start helping people pull useful data out.
But it’s the assumption of “getting this data into the central database” that was highlighted as being the issue in Alec’s blog post. The laws that forced insurance companies left more than enough leeway for companies to export their data in a mishmash of formats. CMS is the government agency in charge of all this and has published a very lengthy technical clarification of details of the law in addition to the official GitHub repo w/ the technical specification. Just looking at those docs alone should convince you this is a monster in its own right.
Also, what’s an acceptable format? Whatever the insurance companies want so long as it’s an open format. JSON was mentioned, but you can use XML or even YAML. Luckily most seem to be publishing big JSON files instead of XML. Every single company will be unique in how they specifically format everything.
Question #20: What are the acceptable file formats for producing the machine-readable files?
Answer #20: All machine-readable files must conform to a non-proprietary, open-standards format that is platform independent and made available to the public without restrictions that would impede the re-use of the information. Please refer to 85 FR 72242. Currently, the schemas on the GitHub are xml and json files, which are acceptable file formats. Microsoft Word, Microsoft Excel, and PDF files are proprietary and are not acceptable.
Then ON TOP of the raw data ingestion issues, there’s still one bigger mess that needs to be dealt with. Healthcare codes and data are an incomprehensible mess. There are specialized jobs just for billing insurance because healthcare providers (hospitals, doctor’s offices, etc.) recognize that it takes specialized knowledge to bill procedures to insurance companies in the best way to maximize revenue.
Plenty of data folk who do have actual healthcare data/billing experience can attest to just how messed up everything is. People who haven’t touched this data (myself included) are quite unprepared for the reality.
We need some subject matter expert guidance
Right now, there’s a plenty of people with the raw technical skill to move and handle massive datasets who would love to put help out. The problem is that most of them don’t have any domain knowledge about working with healthcare data. We don’t know what questions need asking and answering, so it’s hard to know how to clean up the data to be useful to people.
What we really need right now is someone(s) to provide an overview guide about what’s important in such data so that people can have an idea of what to build towards. It’s relatively low level stuff like explaining some of the most common abbreviations to be found in such systems, typical operating procedures that would generate this data, and even just links to resources like a map for the various codes would be very helpful. It’s fundamental stuff that every data scientist would need some passing familiarity with before they can play with the data, and right now there’s no place to quickly get up to speed besides reading a ton of dry specifications.
Even better is if someone provided an analytical roadmap that people could contribute pieces to.
No, I am not that person, because I don’t have any past experience and have no idea how to deal with this mess.
Once people figure out how the ingestion of data can happen, with scrapers and customer parsers for every unique file structure every insurance provider gives, the data can be piped into a data warehouse.
The data engineering bits can be solved with some help and/or money
A couple hundred terabytes of (uncompressed) data is “enough to be a pain in the butt”. Simply storing 100TB in standard S3 with no transfers/operations would cost just a hair over $2300 a month. The text data involved would naturally be stored in gzip or xz format for big compression ratios, so the actual amount of data to be parsed is a couple of orders of magnitude more. Marshaling the computing resources to query and analyze that would of course cost significantly more than plain storage.
But all told, the engineering infrastructure needed to handle such data can be easily covered with either a grant to a nonprofit, or flat out just donated by some big cloud provider as a public dataset.
Some of the people responding on Twitter had pinged some folk at Kaggle, and individuals there at least seemed interested in at least taking a look at the data. I have no idea if they have any plans on doing something with the dataset in the future.
If the data does find its way into a public dataset server somewhere, that would solve a majority of the data engineering problems. From there it’d be readily accessible to data warehouse tools, data science notebooks, and whatever.
If we get this far, then we can finally do analysis. But first we need to get started. So let’s take some tentative steps towards understanding this journey ahead.
Let’s see what we’re dealing with — what’s this data look like?
First, this is the official data dictionary definitions for in-network files (which are more interesting to us than out-of-network rates). It lays out all the various objects that can be represented in the data. All the output files are supposed to conform to this spec.
Up until now, I’ve been mostly speaking about this work at a theoretical level. The only real way to get an understanding of what’s going on is to take a look directly. So I’m going to walk through a couple of example files from four major insurance companies to get a taste for things.
Buckle up. It’s gets a bit rocky from here on out. You can skip to the end for an overview.
Humana
Humana has decided that they want to provide their data in 106MB sized gzipped CSV files, with 455,130 files released for the month of September. So, assuming all the files are the same size, that’d be 46TB. Surprisingly, their format doesn’t compress all that great in gzip because the one file I downloaded extracted to a mere 397MB, so we can expect roughly 4x expansion, putting us at 170ish TB of uncompressed CSV if everything were dumped out. (Incidentally, I compressed the same CSV file using the modern xz compressor and the same file becomes 34MB).
Let’s see what’s inside this CSV file…
#Extra newlines inserted for legibility
$ head -3 2022-08-25_1_in-network-rates_000000000000.csv
REPORTING_ENTITY_NAME~REPORTING_ENTITY_TYPE~LAST_UPDATED_ON~VERSION~NPI~TIN~TYPE~NEGOTIATION_ARRANGEMENT~NAME~BILLING_CODE_TYPE~BILLING_CODE_TYPE_VERSION~BILLING_CODE~DESCRIPTION~NEGOTIATED_TYPE~NEGOTIATED_RATE~EXPIRATION_DATE~SERVICE_CODE~BILLING_CLASS~BILLING_CODE_MODIFIER~ADDITIONAL_INFO~BUNDLED_BILLING_CODE_TYPE~BUNDLED_BILLING_CODE_VERSION~BUNDLED_BILLING_CODE~BUNDLED_DESCRIPTION
Humana Inc~Health Insurance Issuer~2022-08-24~1.0.0~1427041656~541399329~ein~ffs~All Possible Codes~CSTM-ALL~2022~CSTM-00~All codes possible~percentage~80~9999-12-31~~professional~~~~~~
Humana Inc~Health Insurance Issuer~2022-08-24~1.0.0~1902918147~570521191~ein~ffs~All Possible Codes~CSTM-ALL~2022~CSTM-00~All codes possible~percentage~75~9999-12-31~~professional~~MODIFIER: QK,P5,G9,AA,P1,P6,23,32,P3,47,QX,QZ,P4,P2,G8,QY~~~~
Um… it’s… TILDE DELIMINATED?! CSV wasn’t explicitly stated as a valid format (granted, it’s not proprietary either), but I’m positive tilde-delimited is not normal. TSV is also definitely not mentioned in any of the official specs.
I can only assume some engineer figured that they never use the tilde in their actual data, while they definitely use commas in the 3rd row. I guess if you have 170TB, you can save a couple of percent here and there by not using quote characters in a more typical comma separated CSV dialect?
Otherwise, the file itself appears to “read” fine for computers, but it’s nigh gibberish to a human. What’s CSTM-00? What’s “All codes possible” mean in this context, all the codes between the insurance company and some entity marked by the TIN number? The “percentage” means the insurance simply pays a percentage of whatever bill comes down. And what the heck is that long list of modifiers?
Further down the file, you can find entries for specific codes, each billing code and its respective negotiated rate is a separate line:
Humana Inc~Health Insurance Issuer~2022-08-24~1.0.0~1841203361~007466914~npi~ffs~Diabetes Outpatient Self-management Training Services Group Session (two Or More) Per 30 Minutes~HCPCS~2022~G0109~Diabetes Outpatient Self-management Training Services Group Session (two Or More) Per 30 Minutes~negotiated~8.38~9999-12-31~~professional~~~~~~
Since I don’t know how to read these data points, I’d have to download everything just so I can sift through and check if “All codes possible” means that the particular TIN only appears once in the whole thing.
Despite the utter madness of this CSV/TSV file, one thing it has going for it is that you can directly load this into a database via a well crafted load from CSV command! There’s no “parse a multi-GB JSON object into memory and pray nothing explodes” step. Which we are going to run into next.
Aetna
Let’s see what Aetna is up to.
They have apparently 2 types of files — table of contents files (a handful of bytes) and some data files. So far, so in-spec.
The problem is that Aetna has decided to be confusing right from the start by listing their files in a big table in ridiculously similar ways (that probably resemble the confusing names of their plans). They list 12 entries in the big table of data to download. I struggled to tell the lines apart so much that I eventually dumped it into Excel to run UNIQ() on the entries. Once I did that, it became apparent there are only TWO unique data files for the table that cover everything, the remaining lines are duplicated repeats.
The files are roughly 3.5GB each of gziped JSON. Uncompressed, one of the files becomes 137 GB, a significantly better compression ratio than Humana achieved.
#the list of "plan names" in the table, de-duped
FRONTLINE TECHNOLOGIES GROUP, LLC DBA FRONTLINE EDUCATIONAetna Choice POS IIA
FRONTLINE TECHNOLOGIES GROUP, LLC DBA FRONTLINE EDUCATIONHSA Aetna Choice POS IIAA
FRONTLINE TECHNOLOGIES GROUP, LLC DBA FRONTLINE EDUCATIONHSA Aetna Choice POS IIAB
FRONTLINE TECHNOLOGIES GROUP, LLC DBA FRONTLINE EDUCATIONManaged Behavioral HealthA
FRONTLINE TECHNOLOGIES GROUP, LLC DBA FRONTLINE EDUCATIONManaged Behavioral HealthAA
FRONTLINE TECHNOLOGIES GROUP, LLC DBA FRONTLINE EDUCATIONManaged Behavioral HealthAB
If you dare to try to open the json files with a simple HEAD
command, your terminal will scroll endlessly because they did not put any meaningful newline characters in the file, which is a pain in the butt for casual browsing, forcing me to just grab the first 750 characters.
#Grab the first 750 characters of whatever's in this file
$ cat 2022-09-05_81afc4f2-b318-4455-889f-cd774d65f23e_Aetna-Life-Insurance-Company.json.gz |gunzip|head -c750|less
{"reporting_entity_name": "Aetna Life Insurance Company", "reporting_entity_type": "Third Party Administrator","last_updated_on":"2022-09-05","version":"1.0.0","provider_references":[{"provider_group_id":104607,"provider_groups":[{"npi":[1982060935],"tin":{"type":"ein","value":"471636309"}}]},{"provider_group_id":595083,"provider_groups":[{"npi":[1245394527,1407976152,1295724193],"tin":{"type":"ein","value":"231396795"}}]},{"provider_group_id":481245,"provider_groups":[{"npi":[1063844108],"tin":{"type":"ein","value":"311568203"}}]},{"provider_group_id":46259,"provider_groups":[{"npi":[0],"tin":{"type":"ein","value":"223401654"}}]},{"provider_group_id":576315,"provider_groups":[{"npi":[1023091691,1467435024,1114900776,1194709675],"tin":{"typ
Just entries in a big array of healthcare provider ID codes at the top. Eventually if you manage to get far enough parsing this heavily nested data structure, you should be able to uncover actual billing codes and negotiated prices…
The data inside looks relatively well behaved and organized. Presumably they verified it against the official JSON schema that was provided. The problem is that I definitely do NOT have 137GB worth of RAM to even load the complete file into memory, let alone parse it out into an actual JSON object to inspect or export to a database friendly format like flattened CSV.
Cigna
Cigna has a very simple setup where there’s an URL that they… literally want you to copy paste into your browser to a single file. They refuse to make it a link for some reason. I guess they want to make it easier for a dev to copy and paste it?
The Table of Contents file itself is 18MB of uncompressedJSON. That’s it. It’s supposed to be used when data files might mix the in-network and allowable-costs files. It easily parses into Python with a quick import json
for inspection.
In [26]: j['reporting_structure'][2500]
Out[26]:
{'reporting_plans': [{'plan_name': 'SAINT LOUIS MO CONNECT NETWORK CIGNA Health and Life Insurance Company',
'plan_id_type': 'hios',
'plan_id': '74483MO0040048',
'plan_market_type': 'individual'}],
'in_network_files': [{'description': 'in-network file',
'location': 'https://d25kgz5rikkq4n.cloudfront.net/cost_transparency/mrf/in-network-rates/reporting_month=2022-09/2022-09-01_cigna-health-life-insurance-company_saint-louis-mo-connect-network_in-network-rates.json.gz?Expires=1664526766&Signature=PloHBtyPHBGmP2aViV7JOUKAvnesS2pMgebddbR8FrWyn6JCPp9W9U-LEIR878Ku4674z-xGzBWYtWfIz8M8J6HtiF8xrqmpmZh1NqSLmOiWR2I3Lp61cVxbRoBnDm2pzM1Ysp1aoMV4mALNZZJokHcFM-bxV28vmYg18aR1Mt24HQay9~D1645YRO7NISNwskDPywMxdCl6rDFLHlb2zry4erfi0y8SVEBn-0e4tA9WSDr4~cNhKKqaTFfcK7Z010TH~CxAx~IyiU07-w9q8wCfiGTMbpou7rcsM~WHsJlPSyL3Vfd3kS3Vh0KKTgFl9bueeJB1zt4jQIKa5epAdQ__&Key-Pair-Id=K1NVBEPVH9LWJP'}],
'allowed_amount_file': {'description': 'allowed amount file',
'location': 'https://d25kgz5rikkq4n.cloudfront.net/cost_transparency/mrf/allowed_amounts_empty/reporting_month=2022-09/2022-09-01_cigna-health-life-insurance-company_empty_allowed-amounts.json?Expires=1664527943&Signature=RMC5GVBJr2K9r2~OSmkntatvY1oGsNxCscCb41DO0Crnm0LZ8wpJjW3dUk851yUh3Q5WijUJiygpmeaqy7WVRDc7l-QmMV8HPHPHSB-2cXwvLf2qEd2I8XZNp8DMokJ~zg8-3XmYIDGQK524Ep5xXYUSbWfTJiTPfNOJ5cY0nfWKsQ7XrdvuRIsj8jGKreMUPaKg782~c3DwE4vkowmttR6xaFJkS4effXu2LCB~Zs3U2DHG3WzWGn6UqlQnlYzoPcEpZeOIFb9j4p3yqhuh4p1I2UOnLWx9Kpv2G-57hsps0QJlJJAVeIkbjYy9-2AR9r0t-jgpxGn7WlOvT3vmLg__&Key-Pair-Id=K1NVBEPVH9LWJP'}}
Within the ToC file, there are 18,156 entries, each with a list of files to download. A bunch had just a single “allowed amount” file for out-of-network entries that we’re not really interested in.
If you sift through enough entries, you’ll finally find ones that have the in-network files we want. The first one I found summoned a 4.6GB file from the server, named “2022-09-01_cigna-health-life-insurance-company_saint-louis-mo-connect-network_in-network-rates.json.gz”. That file would decompress to a paltry 11.7GB. Big enough to be annoying to work with.
Peeking into the file we’d find the data we’re expecting, just endless streams of numbers, entity identifiers, and occasionally billing codes.
{"reporting_entity_name": "Cigna Health Life Insurance Company","reporting_entity_type": "Health Insurance Issuer","last_updated_on": "2022-09-01","version": "1.0.0","in_network":[
{"negotiation_arrangement":"ffs","name":"MIDDLE CEREBRAL ARTERY ECHO","billing_code_type":"CPT","billing_code_type_version":"2022","billing_code":"76821","description":"MIDDLE CEREBRAL ARTERY ECHO","negotiated_rates":[{"provider_groups":[{"tin":{"type":"ein","value":"431459495"},"npi":[1215132949,1396816880,1740247378,1871604561,1861570004,1003109828,1083019020,1295939858,1871192781,1649708264,1295288132,1366642704,1699141747,1407058084,1558729905,1487940961,1477034437,1255320578,1457896664,1669418216,1669018602,1235372244,1194112946,1659577047,1467569954,110
Alec’s github mentioned that Cigna’s files were corrupted at the time of his writing, but things might have been updated? I can’t be 100% positive because that 11.7GB file crashed my computer when I tried loading it into Python due to excessive memory usage.
Passing the file into jsonlint-php
, one of many tools available to check if a json file will parse … it spent hours handling the file before running into memory issues again. As I was writing/editing this post, I let it take over 6 hours and 43GB of RAM, with no end in sight. I eventually had to abort because the WSL2 VM instance was about to cap out the 45GB of available RAM I’d allocated to it. Most other JSON validators I looked at all seemed to attempt to parse the whole thing into memory so I’m honestly not sure what tool is good for this situation.
Update: I’ve later learned that CMS released a validator tool to validate that these gargantuan JSON files conform to the actual reference schema. These implement a streaming method so they don’t require the whole file be loaded into memory.
UnitedHealthcare
Moving along, United Healthcare, has chosen to list out 56,470 individual files on their site for download. They are helpfully(?) named by the entities involved — the official filename schema is <YYYY-MM-DD>_<payer or issuer name>_<plan name>_<file type name>.<file extension>
so they seem to name their plans according to certain entity names? They don’t exactly follow the schema perfectly.
7267. 2022-09-01_Bind-Benefits--Inc-_TPA_DOCS-ON-DEM-MEDEXP-THE-WELL-REAL-APPL_UNITEDHEALTHCARE-CHOICE-PLUS_-DX_UCQ_in-network-rates.json.gz
56466. 2022-09-01_style-craft-Cabinetry_index.json
56467. 2022-09-01_threeArch-LLC_United-Healthcare-POS_allowed-amounts.json.gz
Luckily there only seems to be about 3759 in-network files (and a bunch of index files I didn’t browse into that might link to more). I got lucky and grabbed one in-network file that was only 161k of compressed json (2022-09-01_UnitedHealthcare-Insurance-Company-of-New-York_Insurer_Empire-MPN_CSP-1-A350_in-network-rates.json.gz), which was small and easily parsed for browsing. I also downloaded a 4.1GB file that expanded to 122GB. The JSON within looks about on par as with most of the others.
Summary of the Data
From the looks of it, with one notable exception, all the files involve huge blobs of JSON due to the interaction of the technical specification and how insurance companies operate/organize their rates. Much as people instinctually think that insurance companies are just exercising malicious compliance, I think at least part of the fault lies in the specification itself.
Every in-network file has a root object for each plan (either one, or multiple, plans in a file). That plan has an in_network
field that holds an array of in-network objects, each in-network object is a billing code, each in-network object also has an array of negotiated rate objects. Each negotiated rate object then specifies the rate plus the long list of providers that get said rate. There’s also a bunch of metadata involved, like provider descriptions, modifier codes, etc.
The problem is that insurance companies don’t have that many plans, but they have A LOT of providers (think back to those giant telephone book sized directors they sometimes still send out… wait, do y’all still remember phone books?) who might get a given rate. Then multiply by the fact that every provider will be listed under a vast multitude of codes based on the services they provide.
And it all is forced into a single plan object that your poor computer has to parse into memory for us to work with. An alternate arrangement of the data might have actually yielded a less clustered data format.
In practice, we don’t want to work with JSON, or XML, or anything other transmission format. All we really want is to pull out specific fields and rearrange everything into a flat structure for database use.
I honestly think it would be easier and more memory efficient to write a custom parser that understands JSON enough to walk through the file extract and emit the specific fields we care about in a denormalized fashion. I suspect a state machine can do this while reading the inputs on a per-character basis, but I’ve never written a JSON parser before to know if that’s true.
Another post-publish update: I’ve been informed that the jq tool actually works as a state-machine parser and can handle massive json blobs like this, so it might be a good way forward in wrangling these files.
Be warned if you want to work with these files
Bring a beefy machine: If you’re going to wrestle with the JSON as-is, handling these files will require a beefy machine, especially in the RAM department because JSON structures eat up a ton. My desktop has 64GB, 45GB of which had been given to the hypervisor, and Python exploded loading a mere 12GB file. Expect to use lean tooling and methods to just barely squeeze things out.
You’re gonna need bandwidth: even if you don’t store these files locally, perhaps stream processing them, you’re still going to have to download the things. My experience is that they transfer fairly quickly, but you’re still limited by your ISP.
Get super familiar with the spec: There’s a lot of information there that’ll help you figure out what is probably irrelevant.
Get help from people who understand healthcare: there are 16 types of official billing code standards, plus one “Custom”. Someone’s going to have to explain to us whether these standards overlap in a way that we need to handle them in different ways. We also need to understand what codes can/can’t be grouped together in certain circumstances.
Share! Share! Share! With multiple providers with their own interpretation of the specification, tons of data, and lots of different people who want to ask different questions of the data, there’s no way a single person can cover all the angles. If we want a public conversation around this work, the only way forward involves sharing what you’re doing with other people who care.
Good luck!
I’ve got too many things on my plate to be of any serious help in an effort to wrangle this data. I also don’t have (personal) access to the compute and storage resources to process all this and it’s not something even tangentially related to anything I do for work. So I’m going to end my exploration of this data with this post.
But if there are people who are definitely interested in writing code, or have the domain knowledge to understand and help make sense of the data, or have access to computing resources that can be used, I’ll be more than happy to act as a temporary connecting point to connect folks together.
Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With excursions into other fun topics.
Curated archive of evergreen posts can be found at randyau.com.
Join the Approaching Significance Discord, where data folk hang out and can talk a bit about data, and a bit about everything else.
All photos/drawings used are taken/created by Randy unless otherwise noted.
Supporting this newsletter:
This newsletter is free, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Tweet me - Comments and questions are always welcome, they often inspire new posts
A small one-time donation at Ko-fi - Thanks to everyone who’s sent a small donation! I read every single note!
If shirts and swag are more your style there’s some here - There’s a plane w/ dots shirt available!