Email Analytics: More than you ever need to know

Black boxes are hot again

Feb 18, 2020

Hats off to Vicki for sparking the motivation to write this, I’m always open to ideas and suggestions for topics to write about *hinthint*

Vicki Boykis @vboykis

This is worth a discussion and it's not unimportant. But I also think it pales in comparison to the other egregious and much more harmful privacy violations we experience on an everyday basis. As a similar example, I care much more about FB tracking me than Netflix.

🧑🏻‍💻☕️ @hunterwalk

As media companies & blogger/influencers "shift to newsletters," some privacy journalist should cover how much data these platforms provide abt individual subscriber behavior Because the answer is A TON..... & it's all tied to a real email address, not just some general cookie

If you came to me in 2010 and told me that email tracking analytics would be the hot stuff again in 2020, I would’ve thought you were crazy. But here I am, writing to you in an email newsletter, talking about email tech and analytics.

Thanks to how primitive email is, it’s a surprisingly short topic. I can dump practically everything I know, down to tiny technical details, into this one long post.

So what does email tracking look like on Substack?

We get some interesting stats, for the most recent post you have sends, open rates, and click through rates. The views and signups (conversions) appears to be for the website page and not the email itself.

Then you also get more detailed email stats at the email address level. Opens, Clicks and device counts.

We’re now going to see how every last one of these is collected. Woohoo!

Modern web analytics are practically continuous connections

Modern web analytics involve a huge array of tools, from server-side logs, user-agent analysis, cookies on your browser, to very sophisticated measurement tools using Javascript and websockets to collect and transmit data back in real-time. Every click, scroll, and hover could potentially be reported back to the mothership for analysis.

Analytics platforms like Google Analytics, Amplitude, Chartbeat, Mixpanel, Flurry, etc. all work in this space and use a bunch of various methods to collect data on what users are doing. At the core, they collect information in the form of “events” that happen (page view, click, scroll, hover, etc), and report the information back to the analytics service on a regular basis. It’s a fairly complicated space.

This is not even restricted to just web pages. Recently there have been articles discussing how every single click, touch, and interaction on a Kindle is tracked and stored for use by Amazon.

Despite how creepy and invasive this may sound, this sort of ultra fine-grained data collection and transmission is almost a given in the modern analytics age. I’ve had to rely on similar sorts of data to help teams make product decisions. Many sorts of UX and business questions can’t be answered without at least some data at this level of granularity. The tools to do this are also readily available now, so everyone can do this, it’s not just megacorps.

Meanwhile, email (and to an extent, SMS) has practically none of these advanced tracking features. That presents challenges to us.

Email is (largely) one way communication

Email is very old tech. There’s an RFC, #196, dating from 1971 titled “A Mail Box Protocol”, that describes a way to send messages to a system that is connected to a line printer and the messages are delivered, by hand, to physical mail boxes.

Obviously there’s no line printers and hand delivery today, but the fundamental concept is still intact: email is a one-way communication channel. It takes a message, a mere collection of bytes (it defaults to 7-bit ASCII encoding!), and then sends it through a chain of systems that attempt to deliver it to the intended recipient.

From the very beginning, the entire email infrastructure was designed in a way that mimics physical snail mail. User submits mail to a Mail Submission Agent (MSA) via their mail client, it then gets routed via DNS lookups on MX records to one or more Mail Transfer Agents (MTA) that transmit mail over Simple Mail Transfer Protocol (SMTP). Finally, after one or more hops through MTAs, the bytes arrive at at the Mail Delivery Agent (MDA) that puts it into a data structure that is stored and read by the recipient's mail client. The mail is literally stored in one or more files at the destination.

Aside: I’m going to avoid going down an mbox format rabbit hole for length reasons, but very briefly, it’s amazing. Emails are all in one file and delimited with the word “From “! This is the CSV of communication storage!

For the entire length of time from when Alice hits [Send] on the email client, until her email is delivered to the Mail Delivery Agent (the program that your mail client will access via POP3/IMAP4/etc to get the mail in front of your eyeballs), email is transmitted under a PUSH paradigm. The SMTP command list essentially only has the verbs to say “Hi server, I have mail for you, take this and send it along”. There’s definitely no way to articulate “Hi server, give me all my email” within SMTP. MTAs don’t support that functionality.

In fact, RFC 5321 explicitly states that when one MTA gives a 2xx response for mail transferred to it, it accepts responsibility for making sure the message goes to where it needs to go, or to report the failure to deliver.

Only at the final leg of the journey, when Bob the recipient is loading up his mail client to check email does it finally become a pull action. The recipient's client can use the Post Office Protocol (v3), POP3, (intended for pulling the mail off the MDA, usually removing the messages from the MDA) for local consumption, or Internet Mail Access Protocol (v4), IMAP4, (intended to read mail off the server and leave it there, e.g. webmail). The client can also use some proprietary protocol if it wants to.

What all this means is that once the MTAs have successfully passed the message along, they’re allowed to forget your message (if they want). The only time the sender gets any feedback about the process is usually if delivery fails due to some issue (incorrect address, etc).

But email isn’t completely one way (wait what?)

There are two (!!) ways where the email specifications themselves allows for communication to flow backwards from a receiver to the sender. With caveats. Big ones.

Delivery Status Notifications

First is RFC 3461, an extension to SMTP to add Delivery Status Notifications (DSNs). This is the functionality that notifies you if you send an email to a nonexistent email address and it can’t be delivered. It’s actually an optional flag in the headers which the sender can request to be notified if a message failed to deliver, or succeed. Normally it just defaults to only reporting back if there is a failure or delay to deliver.

The problem is this is an optional feature, especially for Success DSNs. If an MTA, or a MDA, chooses to restrict the usage of that feature. It’s going to not work. In practice, successful delivery notifications are almost never used and you should never count on them to work. Please don’t start turning them on.

Also, a successful DSN only tells you that the computers have delivered it to an inbox, it says nothing about whether a human has seen it. It’s like the postman put the tracking confirmation box on the porch, at any moment something can make the package disappear. The message could also be completely ignored, filtered, or deleted.

Message Disposition Notifications

The second way are what’s commonly called “read receipts”. Depending on what email service/client you have (especially if you use MS Outlook and Thunderbird), you may recognize the term. These things are actually standardized under RFC 8098, “Message Disposition Notification”, though there are various proprietary protocols for this functionality. It’s mostly seen in corporate environments where everyone uses the same software.

Unlike the DSN mentioned above, these are triggered when a user actually opens and reads the email. These MDNs can report that the email was “displayed” to a user, “dispatched” somewhere else, “processed” without being seen, or even “deleted” without being seen. Wow! So much information!

EXCEPT, it requires the recipient client to support the feature (many don’t or have restrictions), and there is often an aspect of opt-in involved. Oftentimes there is a prompt, like “Alice would like a read-receipt, send? Yes/No”. On top of those issues, there are lots of edge cases, for example what happens if a user simply marks an email as read without opening it?

So, again, don’t count on using this data.

So what CAN we count that involves email?

Not too much, but some things. Let’s go through the big ones:

Emails Sent

This one is easy, because you, as the sender, should know how many emails you’ve sent out. This at least gives you your universal denominator for everything else. This is probably the only number that is certain in ALL of this.

Bounces/failed deliveries

Thanks to the DSN feature of SMTP, you’ll generally (but not all the time) be informed when an email you sent out is undeliverable. That’s important to know to adjust your email list. It’s also common to just deduct these emails from your “Sent” totals because they didn’t really go anywhere.

Oh yes, if you’re sending email with your own servers, from your own domain, you might actually get MORE bounces than emails you sent out. This is called backscatter, where a spammer pretends to be sending email from your servers by forging headers on their emails. So you have to be pretty careful about how you handle bounces.

HTML email gives you more options

If you’re sending plain text emails, the above stats are pretty much all you’re going to get. But if you’re using HTML emails (which is what most people send and receive nowdays), you’ll have access to much more.

Opens via tracking pixels

Modern email clients are actually miniature web browsers. At the least, they can render HTML and that includes the ability to fetch resources like images off the internet. Note, there’s also a way to attach images into the email so no external fetch is necessary, but that doesn’t help with analytics.

One common way this HTML feature can be used for analytics is by putting a tiny 1x1 pixel transparent GIF file (hence the name) that is hosted on an external server somewhere (like, example.org/tracking_string/sneaky_1x1.gif). When the email client tries to load the image, it will have to make an HTTP GET request to that webserver. The server then realizes that the GIF for “tracking_string” has been loaded and it saves that information somewhere. Using query parameters, example.org/sneaky_1x1.gif?track_query=user12345, are also an important way of communicating details to the server.

If you generate a unique tracking_string for every email that’s sent out, you can then figure out who has opened the email just by reading those pixel access logs!

Note: It doesn’t have to be a 1x1 pixel, or a gif. All you need is the recipient to load something, anything from your servers. A logo image would work just fine.

Caveat: The one problem with this method is that it depends on recipients loading your image. For security reasons, certain clients do not do this by default, and at the minimum you can usually change a setting to opt in/out of remote image loading. So, again, this data will not be perfect.

Cookie via tracking pixel

The pixel’s HTTP GET request can also set a cookie on the client’s browser too. Whether this does anything will depend on if the mail client’s browser feature has cookie functionality. Very often it’s turned off by default, so it’s not wise to rely on it.

Obviously these cookies aren’t expected to pass to the main browser unless the email client and browser somehow share the same cookie store. I can’t even find any references online to such shared behavior.

So just about the only thing you can do with these cookies is IF you set and see the same cookie on multiple opens, you can know that it had been opened at least once before on the same setup. Maybe.

Device via tracking pixel

Because user-agents are supposed to be sent in the header of every HTTP request, you can analyze it to figure out what device is opening the email. It’s too much to go into here, so for those curious, I go into the complexity of user-agents in a previous post here. For popular devices, it can allow an analyst to see down to specific models and variants.

Store a copy of these user agents over time and you can figure out just how many browsers and devices a person has access to. Freaky.

IP via tracking pixel

IP analysis is rather inconsistent thanks to dynamic IPs assigned via ISPs, mobile device IPs changing as they move around/enter/leave networks, VPNs, and proxies, but in the aggregate, they provide an idea of where people are from via geolocation. They’re usually able to tell you the country the request came from.

Some IPs are more durable than others, static server IPs and home IPs obviously change less often than mobile devices. VPNs and proxies can have different properties also that let you at least know that they’re not “a normal user”. So if someone had this information, they could be thorough about investigating, but it’s a pain to do at scale without paying for a service to classify the IPs for you.

I often don’t bother looking into IPs because there’s not much I can do with the information even if I had it.

Clicks on links via tracking URLs/redirectors

Since we’ve got HTML, users can click on links which will open in their actual web browser (as opposed to whatever is internal to their mail client, which may have limited features). How can well know if a user clicked on a link, or what link was clicked?

URL redirection is the easiest way to do this. There are very many ways to do redirection.

The simplest way to use the HTTP 302 response code “Found” (previously called “Moved Temporarily”) to make a browser redirect to another URL. Other 3xx response codes can work too, w/ varying nuances. So all you need to do is have replace your original link with one that goes to myservers.com/tracking_string1, then set your webserver to send a 302 response w/ the actual intended destination.

Since you can make the link unique and you know exactly which email it was in, you can definitively say a click originated from that email. Just remember that emails can be forwarded, links can be shared.

You also get an entry in your webserver logs, with all the usual information, IP, user-agent, timestamp, referrer, etc. But that’s it. If you want more analytics, you’ll have to use other methods.

The more powerful way is to make a page that both collects data and redirects the user, via a META tag, or JavaScript. While the user is briefly on your page, you can do more complicated browser fingerprinting.

There’s more technical bits to work through here. You have to do everything quickly because users HATE when things load slowly, and tracking redirects adds precious milliseconds to load times. You also may want to do some form of referrer scrubbing to hide the fact that you’re doing these redirects from downstream.

Caveats: There’s very little to guard against this form of tracking if the URLs are hidden behind redirects. So it usually means you will not undercount clicks. However, thanks to the aforementioned link sharing, you can easily count many many clicks. They’ll be actual clicks, but you won’t really be sure “who” is doing them.

Here’s an example from the email substack sent last week on my 10x data scientist article (styling and stuff stripped out). Note the complicated hash string that takes up the bulk of the URL.

<a href="http://email.mg1.substack.com/c/eJwlUMuOwyAM_JpyRA55UA4ctt32NyIC3hRtAhGYjfL3S1ppJEtjax62hnCO6dBbzMRKxjR6p4eulaCAOQ1SWDkxn8efhLgav2i2lWnx1pCP4TwW8top9tKIQrWyU-BML0DIrkeB0KDr5TQADuy0GE1xHoNFjX-YjhiQLfpFtOVL-3URz4p937kzZF64bJgyj2muLPO6ikJF17TQC8UbrhoJ1--buqvb8_EQw6WDdW54LlMmY3-5jStLOpngDm5K3c5ngTddO4x1riV4OkYMZlrQaUoFGX1-8U5Lx4Y64J4XJML0Ic_OA4AaWHVysWoGbWMJ5MP8D_lucHE">datahelpers.org</a>

If you follow the link, you get 302 redirected to the correct site.

Cookies via clicked links

This starts going into your normal web analytics. But unlike the cookies you could potentially set on the tracking pixel, which is bound to a web client may not allow cookies, you’re most likely getting a real browser with a link click. That means that the cookie information is more durable and useful. At the very least, you can recognize if that same browser has visited the site before via email or not.

Plus you can do all your normal web analytics stuff on the user now that they’re on the site itself.

Web analytics once they get to your site

While the things above are all that you can get directly from an email, successfully getting someone to a site you control opens them up to much more tracking and analytics possibilities.

If your email is sending a person to a site that you own, and they log in, you can tie the clicks and the login together. This lets you tell a richer story about what the user is doing. Instead of “the user came to my site and bought a widget” now you can say “they got my email, clicked my big call to action link, and bought 10 widgets”.

For people who aren’t logged in, you get much less usable information. Very often their referrer is empty and this leads tools like Google Anlaytics to say the traffic is “(Direct)”, meaning they appeared on the page and GA doesn’t know why, so they must’ve been sent there directly. Obviously no human is hand typing long URLs into their browser, but we just don’t have any information as to where they got the link.

Thanks to this lack of referrer, from the standpoint of web analytics, an email client looks exactly the same as a SMS message, or a link from some other external application like Slack, etc. They ALL generally look like (Direct), so it can be a mess. You can always assign unique links to different communication channels to help mitigate this, but it’s still just a best-guess when links inevitably get shared.

The reason for the lack of a referrer is a bit complicated, it’s not some grand conspiracy or privacy movement. Stand-alone mail clients sending users into the browser send no referrer information to the browser (they pretty much can’t). Webmail clients typically use HTTPS, and referrers are dropped when going from HTTPS to HTTP sites (but NOT HTTPS->HTTPS), they also may use a redirect page with a meta tag such as <meta name="referrer" content="no-referrer"/> (or something similar) to further scrub the referrer for privacy reasons.

Counting Stuff

Discussion about this post