Interpreting Email Analytics is Handwavy

What you can do when you actually collect email data

Last week, we jumped off the deep end into just about every last scrap of data that could be collected from email. We noted that there are lots of little caveats and limitations in the fundamental technology. But we didn’t have time to go into how to actually use the data. That’s why were’ here.

Email tracking data has interpretation issues

The problem with email analytics is the lack of detailed visibility into what is going on at the user’s end by virtue of the email client being completely divorced from our analytics systems. While we have the tools to collect quite a bit of data, the extent of the conclusions we draw from the data need to be carefully considered.

Email Sharing

As mentioned previously, the sharing of links and emails will always be a ghost that hangs over all your data. You’ll just have to accept that it happens.

A shared email would manifest itself as more opens (and clicks) for a particular email than expected. While a normal human may open a given email 1, 5, maybe even 10 times in a week, at some point it’s going to be too much for a single person. If I saw an email get opened 50 times within a few days, I’m going to assume it was shared unless I had a very good reason why any person would open it so many times.

At the same time, I probably wouldn’t care too much unless this was an email that was supposed to be private for some reason. Like perhaps it was a subscriber only email, and the same person is obviously sharing the email, repeatedly for many emails. If I had a problem with that (not to say that I do), then this would be a way to identify and deal with such behavior.

For a small recipient pool, the simple statistic of Opens/Sends would get strongly distorted by sharing. It’s an average and outliers skew them hard. However, as the number of emails get sent increases, the law of large numbers starts working in your favor. “On Average” you’ll only have a relatively small number of sharers, and open rates become more predictable. It’ll be a relatively steady distortion.

A better way to mitigate the effects of sharing is to count the % of recipients that have opened up the email at least once. Effectively normalizing against the individuals. Then a rampant sharer would be on equal footing as anyone else. If you’re trying to judge how well your emails are doing, this could be a better metric to use.

Link Sharing

On the bright side, a link being shared is something that could happen with any link on the internet, not just your email link, so you’re not at any particular disadvantage with respect to web analytics. That said, shared links could also present interesting challenges.

Shared links can happen multiple ways. The most obvious one is via the entire email being shared. The second most common way would be the user clicks a link, then copies the URL from their browser. Finally, sometimes people will copy and just share links from the email.

In two of them, shared email and link copied from email, the links within the email are preserved verbatim. Browser-copied links are unique in that they will bypass any redirects because those aren’t visible to a typical end user. All of these states have ramifications on what your end statistics will look like.

We’re going to have to take a bit of a technical detour here to clear up just what information you can get out of a link.

When the original links are preserved

When the original link redirects are preserved, it means that any click will trigger all the tracking mechanisms that are attached to the links. So if your links were set to identify who the email recipient was to identify who clicked, any shared clicks will register as that person now. (Security concerns, yay!)

Most often you will only see this by looking at the distributions of clicks. A shared email will stand out for having significantly more clicks than the long tail. As with many things, it follows a power-law-esque distribution because email shares are fairly rare events, and each sharer could potentially go viral.

This should temper your expectations on what your metrics are telling you. Much like IP addresses, on average, the assumption that 1 click is 1 person will hold up, but the outliers are going are massive and will throw things off. All it takes is one influencer with 50k followers to share a link before things explode.

Should we do anything about this?

Cheap answer? It depends.

More realistically, I don’t think the analysis level is the most effective place to address the issue. Many problems could be caught on the service analytics pipeline end before it gets saved as data.

Could someone build a complicated system that looks at the distributions of IP/Geos and click volume to spot anomalous behavior and declare that a link has been shared? Probably. But I haven’t come across a situation where knowing “person X shared my special email to a million people!” would be useful to me.

To start, because if I were concerned with sharing of information, I wouldn’t trust email. There are more secure (albeit more convoluted) ways of sending secure messages that can’t be easily shared to other people. And if I were so paranoid that I must guard against leaks, I wouldn’t rely on just email metrics alone.

You could always force users to log in after the click the link, which would hard-gate the identity of the user. Then you could even decide to ignore (or mark and silo away) clicks that never log in.

When original links/redirects are not preserved

This is probably the more common link sharing case because it matches typical human workflows. People will click on a link in the email, their browser goes through any redirects built into your link and triggering all your standard click analytics, then they copy the URL out of the browser after they arrive at the landing page.

At first blush, it seems all the useful information (that you normally get via the redirect) is gone. This is largely true if you don’t own the landing page and ALL you see is the redirect. To the landing page, the user will arrive as a completely new and direct user. They won’t even get a referrer header.

However, you do control the final landing destination, there’s one more way to pass information on so that not everything is lost: put it in the final URL!

URL Query Parameters are great

According to RFC 3986, the current standard governing URIs (w/ a few updates that don’t matter here), a URL’s path ends at the first ?, after which there can be query parameters.

An example can be seen with this totally innocent search on amazon I did, note keywords=googly+eyes+large, and a qid and sr parameters.

https://www.amazon.com/Sntieecr-Wiggle-Googly-Adhesive-Decorations/dp/B07H4FYX5V/ref=sr_1_11?keywords=googly+eyes+large&qid=1582573168&sr=8-11

You can generate these parameters, either embedded in the email links, or attached upon the redirect. They’ll be preserved for the most part because very few users will bother stripping out the parameters when they share. This way, your analytics can parse the parameters and read whatever identifying information you choose to put there.

One caveat, if you use parameters, in theory a bad actor can inject arbitrary strings into the parameters. So you have one more weird data issue to worry about.

Yet another sneakier URL way

Parameters can be stripped out pretty easily because they’re easily identified and are often optional. But what if you want to make it harder for people to mess with your data? You can slip analytics into the URL itself.

In the above amazon link, https://www.amazon.com/dp/B07H4FYX5V/ will actually take you to the exact same product. The parameters, and the full text name of the product, are actually unnecessary. Amazon is using that extra info for other reasons that have nothing to do with getting user to the product (e.g. SEO). You could take this idea, and include other bits of tracking into your URLs. It’s much less obvious what bits are optional are not this way.

The downside to this (and there’s always downsides) is that it creates more complexity in your URL structures. Also if you’re doing various rewrite directives, or compliated URL-based routing, this scheme makes your regex EVEN HARDER. So be careful what you wish for here. You could easily make a nightmare for yourself without meaning to.

Making sense of all of this stuff

Now that we’ve gone on a detour to show how you can use clever tricks to squeeze data out of shared links, we can get back into the analysis.

With all the additional information you can glean from links (assuming you can convince people to log in or tie their pseudo-identify to an action), you can broadly categorize activity into two big buckets: 1) actions likely done by intended recipients 2) actions done by other people. You can also choose to analyze everything just at a high aggregate level (which is what most people do).

For the bucket of activity you can reasonable link to intended recipients, you should be able to build a tight interaction chance with your other on-site analytics. At which point, the email merely becomes an extension of your analytics toolset, and you make some allowances for occasional lossy behavior as people switch from client to browser. You can do funnel analysis, etc and generally expect your numbers to be fine. The few “weird” errors (people appearing later in a flow than they should have) can be treated as tracking bugs.

For the second bucket, unintended (shared) recipients. You can treat them as word-of-mouth referrals. Someone had to go through the trouble of sharing something, and recipients had to engage. That’s useful signal right to have. It’s to your benefit to try to separate this unique bucket out from random other traffic, because their engagement level is likely unique.

Finally, for the aggregate stats, you’ll have no choice but to break out the most powerful handwaving analysis weapon in the data science arsenal: (the assumption of) the law of large numbers. Given a large enough recipient base and repeated similar-ish emails, the behavior of the population should average out to something resembling predictability. On average, your open rates will wind up falling within a certain range, the click-through-rate will be in a certain range, same for the number of shares.

The reason why I call it a handwaving tactic is because we’re making a lot of big assumptions by invoking this law. The biggest ones being that 1) the behavior of users will be the same across emails (haha), 2) whatever tracking errors we have regarding emails will be consistent within our email tracking systems, and 3) the people we’re sending emails to is the roughly the same. All of these are very questionable.

Is the email you sent out last week very similar to the one you sent this week? What do we even mean by “similar”? And did we make any changes to our email tracking stack? Did we add/fix any bugs? Did we get any new signups for emails? Have people opted out since last week?

The answer to all those questions is probably in a constant state of flux. But oddly enough, these large sources of error and noise all fade away in the aggregate. At least enough for making rough business decisions.

The hard part is walking the tightrope between accepting these handwaving assumptions so that we can use statistical tests that require them (independence assumption anyone?) to do comparisons, and rejecting the assumptions to say “this email is unnaturally viral and broken the assumptions too far”. Like with any outlier-removal strategy, you can easily fall into the trap of cherry-picking your data if you do both.

Despite the risk, in a business context, there is a huge amount of latitude for reasonable people to disagree about where to draw the lines for what constitutes an “outlier” that needs to be separated out from the main data set. Maybe the outliers should be held in a separate category that belong to themselves and analyzed separately.

However you decide is the best way to handle outliers for your use case is, you should be able to at least have a working idea of what “normal” stats look like for the email you’re going to send out today, and be surprised and intrigued if the actual stats deviate from it.