File formats are things we don’t generally want to think about. Ideally, we just magically package our data with a single command, and it magically can be read out again after transmission to another system without me having to worry about it.
And thankfully, to a limited extent in 2020, this is kinda true. I can pick up a file on my windows machine, upload it through my browser, through an email client written in Javascript, which transmits it as some data in some sort of MIME format, where it hops across a bunch of servers running arbitrary OSes, before it lands in a separate email account that gets downloaded onto a Macbook. Despite the actual bytes being decoded and recorded into all sorts of formats across countless systems that know nothing about each other, it arrives losslessly at the end for me to download. All of this is made possible because the systems talk to each other via a standardized protocols. Since all the software (and hardware) along the path follow the rules, things work out in the end.
In the world of data, there’s less end-to-end standardization of data formats, owing to how many data systems, like databases, take advantage of custom proprietary data formats to optimize the storage and access of data. Instead, there’s a host of different formats that are standardized for data interchange.
For example, there’s XML, first published in 1998, which is a very verbose, human-readable (questionably), markup format for transmitting… just about anything. You can use schema definitions to explicitly define all your data types within XML and transmit it accordingly. In industry, it’s got a reputation of being a pain in the butt to work with, so it’s not super common outside of big enterprise situations, but it’s there for when you absolutely need to force two arcane systems to talk to one another.
More recently, text files full of JSON entries have been a popular portable data format, which tries to strike a balance with a vague semblance of data typing, while being much easier to work with than XML. In many situations, it is the “good enough” solution.
Then comes CSV, comma separated values, and its close cousins TSV and all the other character-delimited file formats. In some ways, they’re the great ancestors of data transmission.
CSV is old
One amusing thing about CSV is that it has been around in various forms for a very long time. Most references (Wikipedia summarizes it nicely) I can find online about the first instances of comma separated files point to FORTRAN as where the format became first used, “List-Direct Input” from IBM’s Fortran manaul for VM/370 (published 1972, page 17)
List-Directed Input/Output
…
Input entries are separated by blanks or commas, with successive commas indicating values to be omitted. The input list may be cut short with a slash…
Fortran 77 (published 1978, page B-12) also mentions:
List-directed input/output allows data editing according to the type of the list item instead of by a format specifier. It also allows data to be free-field, that is, separated by commas or blanks.
This puts CSV well into the era of punched card computing, an input/output paradigm that apparently lasted into the 1980s. My parents had an instruction booklet for punching cards to insert into an IBM computer of some kind in the house from when they had to take computer classes back in the late 70s. I didn’t even realize what it was until I was much older.
Meanwhile, if you search for the term “CSV” in ngram viewer, you see a hilariously weird bump around the 1920s. If you click on the books search in the viewer UI for that date range, it shows a ton of references to texts with people (notably, reverands) with “c.s.v.” after their name. I suspect it’s a reference to the Clerics of Saint Viator, but I’m honestly not sure.
Despite being older than computing dirt, CSV wouldn’t actually be formalized as an actual standard until 2005, in RFC4180, which noted that “[w]hile there are various specifications and implementations for the CSV format… there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files.” It was only formalized in the RFC so that the text/csv MIME type could be formalized. Essentially, we collectively got around to formalizing it as a standard so that they could email it.
So what’s in RFC4810?
The text/csv MIME type defined in this standard is fairly simple, with 7 rules.
Records are on their own line. Lines are terminated by CRLF (aka,
\r\n
, DOS style line terminators, not Unix style LF (\n
) characters). CSVs are just plain text files so this is important.Last record can, but does not need to end in CRLF
The first line is an optional header line that has text labels and has the same number of fields as the data
Headers are records have one or more fields separated by commas. Spaces are part of the fields and can’t be ignored. Last field can’t end w/ a comma
Some fields may be enclosed in double quotes ( “ ). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields
Fields that have are double quoted can have newlines (CRLF), commas, and double quotes within them
A double quote appearing within a double-quoted field must be escaped by a preceding double quote. Example:
“aaa ““q”,bbb,ccc
yields data that looks likeaaa “q | bbb | ccc
The RFC also specifies that the text encoding be US-ASCII, or any other valid IANA charset (which is practically every text encoding around). It also leaves out little details like if UTF-16 is used, should there be a Byte Order Mark (BOM) or not, etc. as usual, the standards leave a bit of ambiguous wiggle room.
But there are various CSV dialects
The problem with CSV is that it’s been around for so long, and is so simple to just slap together an implementation in a few minutes, that there are plenty of variations that have been in use so long, they’re not going anywhere.
For example, if you’re just dumping tables of numbers from a program you’ve written, it’s almost trivial to just open a file handle, and dump numbers separated by commas out into a file. It would very likely be 100% compliant with RFC4180 too, except you’re more likely to be using \n instead of \r\n to separate your lines.
Things only start getting complicated when you need to have fields that allow you to have newlines and commas — now you need some kind of quoting character, and that implies a way to escape the quoting character.
I like how Python’s csv
module allows you to set a bunch of parameters to cover the big ways different CSV formats can differ. You can set the delimiter to be something other than comma (tabs are popular, and my favorite because they’re less likely to pop up in collected text than a comma). You can set the boolean doublequote, set the quotechar, and escapechar to handle the various forms of quoting and escaping out of the quotes, you can set quoting to the constants csv.QUOTE_ALL
, QUOTE_MINMAL
, QUOTE_NONNUMERIC
, and QUOTE_NONE
to control how aggressively it quotes things. Finally you can set a boolean for skipinitialspace to skip the first space after a comma for a field (which would violate RFC4180 but humans often do it for readability reasons).
Other systems that handle CSV data regularly often have facilities for setting similar flags. While doing research on CSVs, I stumbled upon this Google Sheet, linked to from this page, that compares CSV parsing options by a couple of languages, Python, Ruby, then PostgreSQL, MySQL and SQLite, screenshotted below.
And then, there’s Excel
Excel is everywhere, inevitably, you’re going to have to be dealing with either sending data to Excel via CSV, or reading CSV that was written by Excel. As far as Excel’s CSV format goes, it’s essentially RFC4180 compliant. I suspect that since the RFC was published in 2005, Excel was a dominant generator of CSV files, so compatibility was definitely a consideration.
class excel(Dialect):
"""Describe the usual properties of Excel-generated CSV files."""
delimiter = ','
quotechar = '"'
doublequote = True
skipinitialspace = False
lineterminator = '\r\n'
quoting = QUOTE_MINIMAL
class excel_tab(excel):
"""Describe the usual properties of Excel-generated TAB-delimited files."""
delimiter = '\t'
class unix_dialect(Dialect):
"""Describe the usual properties of Unix-generated CSV files."""
delimiter = ','
quotechar = '"'
doublequote = True
skipinitialspace = False
lineterminator = '\n'
quoting = QUOTE_ALL
HOWEVER!
Let’s face it. Nothing is this easy, especially post-hoc defined plain text data formats. There’s always a gotcha.
And there is!
This post from 2018 is a user complaining the Excel generated CSVs are not RFC4180 compliant. The reason being, the locale setting on their windows machine uses the semicolon
as the delimiter in regions where the decimal delimiter is a comma. The only workarounds are essentially to manually change the locale settings as needed.
So why do you like this file format anyways?
Because, when you control both the output side, and the input side, it’s so damn simple to use to transport data between a bunch of systems without much trouble. There’s no worry about mismatches in dialect definitions, etc.
It’s super easy to spit out data, especially if you use a module like Python’s csv, which does all the quoting and escaping of special characters for you. It only really takes a handful of lines of code to slap together.
Finally, it’s one of the densest text-only data formats available. The only real overhead you incur is having a n-1 delimiters for n fields. Even something like a JSON based format contains significantly more repetition — every JSON record would repeat the keys. Obviously a binary format would be more efficient, but compressing CSVs is trivial also.
The biggest problem with CSV is when you find one that’s badly thought out. For example, I used to have to work with apache server log files directly, for whatever reason they servers were set to use the pipe character ( | ) as the delimiter. That worked fine probably 99% of the time. But guess what, random clients on the internet will occasionally put pipe characters, or \n newline characters into their user agents. This is nonstandard and unexpected behavior. I had plenty of analyses and pipelines crash unexpectedly from those characters.
Once in a very long while, you come across a dialect of CSV that is incompatible with your parser. For example, you’ve got a piece of software that does not have the ability to read Tab Separated Values, it MUST have the delimiter be commas. You’ll have to write throwaway translator script where you read in the file, then write out into a new dialect.
While doing research on CSV for this, I came across a bunch of interesting little tools and things.
CSVKit (github) - I didn’t know this existed, it’s a bunch of useful little tools for working with CSV, with things like csvcut, csvjoin, csvgrep, etc. All designed to work quickly with CSVs.
json2csv - a tool that I used a ton from my days at bitly, written at bitly. Takes log files of entries of json, and you tell it what fields you want, it spits out a CSV of them
An old classic post from 2016, “Falsehoods programmers believe about CSVs”
I never knew this until today, but if the first line of a CSV file has
sep={character}
Excel will use that character as the delimiter… granted this completely an Excel-ism, so such files would pretty much make everything else in the universe choke and dieIn a grand example of xkcd’s “Standards”, I found this thing: CSV Dialect Description Format. It’s a JSON file that describes a CSV’s dialect, essentially the combination of all the parameters in the dialect image I have above. I honestly have no idea if this is used by anything of note.
A couple of newsletters ago, I briefly mentioned that the Arecibo telescope had been declared too dangerous to operate and would be decommissioned (This happened in November 19th, feels like decades ago), but then just a handful of weeks later on December 1st, the whole central structure failed catastrophically before demolition could even occur.
Scott Manley’s YouTube channel (normally a fun channel about space, rockets, and Kerbal Space Program) has a long 12min discussion of the footage that was released by the NSF on Dec 3rd. Luckily, no one was hurt because staff had been trying to stay to the safe areas around the telescope for fear of such structural failure. There happened to be a drone inspecting the cables just when it happened, and it’s amazing to view in slow motion.
It’s quite sad to see a great piece of scientific history like this come to such an end.