Sessions for analysis, the eternal fiction

There's no escape

Jul 21, 2020

Anime Expo 2011, 1st Hatsune Miku Live Concert in the US. Finding a photo to express “session” is hard.

A decade ago, a good ML engineering friend of mine once complained to me that “People need to understand that HTTP is a stateless protocol.” He was specifically complaining about people casually talking about measuring and utilizing “user sessions” as a concept while discussing how to analyze products.

Now, my friend knows that people use the concepts of sessions for various purposes because it’s an extremely convenient and natural construct for humans. His point was exactly that—sessions are a constructed fiction. There is nothing within the HTTP protocol itself that imparts state. It’s why you can browse two or more pages of a site in multiple browser tabs, and everything largely works — the servers don’t know nor care what you were doing before, after, or in parallel.

Session basic example Google Analytics

Let’s briefly go over the concept of a session in the context of web analytics for those who aren’t as familiar.

Probably the most well-known definition of a user’s web session comes from Google Analytics, the ubiquitous tracking script that’s used on the vast majority of sites because it allows site owners to get, for free, an advanced analytics package for their web site. Since it’s quite powerful for a free product, almost everyone uses it to start out. You often see it as the backup analytics package for larger sites because it runs in a completely separate tech universe from other packages, and can be used as a reference point.

According to their definition of a session, a user session is a collection of activity on a given site, pageviews and other events fired off by the tracking code. A session starts when a user begins doing things on the site, and ends under the following conditions:

There’s no new activity within a 30 minute window
Midnight happens (Apparently based on your GA account’s view timezone settings)
If a user enters on one campaign (an ad, search, etc), but then re-enters via another campaign, even if ad clicks are otherwise within the same session.

Because of that 30 minute window, a user session can stretch for hours if a user keeps coming back, up to a full day if they go from midnight to midnight. Someone with an auto-refreshing browser tab could easily fulfill this condition.

You can actually change the session timeout settings for Google Analytics, with a bit of work, but the vast majority of people use the default settings.

But session-ization doesn’t stop at 30 minutes

I have no idea how Google Analytics decided on the 30 minutes. While I’m sure that plenty of research went into it, the clean “30 minute” does have a very human “ehhh, about half an hour sounds good enough” feel to it too.

In fact, if you look at the docs for changing the timeout, the recommendation is to change it to be in line with any forced logoff time that you have (like you see with banking web sites). They also suggest potentially lengthening the timeout if you expect users to have to go through a lot of content (imagine hour-long video pages), or lowering it there’s not too much content for users.

As platforms change, definitions also change. Mobile analytics in particular tend to have short session timeouts. On Adobe analytics, it’s defaulted to 300 seconds. Flurry analytics allows you to define it as a minimum of 5,000 milliseconds for a session timeout for a mobile app. Which is sorta nuts, 5 seconds between events, like a touch or scroll.

This brings up the whole point of this post, sessions created through some pretty arbitrary decisions based on your specific use case. And analytics packages aren’t the only place you’ll be asked to work with the concept.

So session are fictions, why do we keep coming back to them?

I think it has to do with how humans process information. We naturally make use of the chunking strategy to group things together to boost our short-term memory capabilities. Sessions is a similar strategy of summarizing a large block of time and activity with a couple of shorthand strokes.

We don’t know how to process a 100-pageview long sequence of events made by a user across 3 days. But in wanting to assign meaning to it, we say “the first 40 views were close together in time and were all within the vacuum section of the store, so the user was shopping for vacuums”. It puts a clear narrative around a chunk of events, gives context, and allows us to reason about it in compact chunk that can be aggregated across users.

Once you have a narrative, like say “the user put something in their cart and is trying to purchase by starting checkout”, you can analyze all those checkout sessions and see if there are problems that can be improved. Human brains are much more used to reasoning information in this way.

If you know that 99% of users who tried to check out failed to complete within the same session for some reason, that’s a red flag. Maybe something is horribly broken in your checkout process. It’d be even weirder if people eventually check out, but in a different session. Maybe there’s something broken in your session definition, or you’re selling a product that has lots of friction in the process.

At the same time, the convenience in reasoning with user sessions can be problematic.

People often want to conflate user tasks with user sessions, for example “I need a new vacuum so I’m going to look at Amazon today.” There are multiple ways to interpret that task. If you interpret it to be “user task is to place an order for a vacuum”, anyone who doesn’t buy a vacuum that day could be considered to have failed at their task. But the actual user journey from deciding they want a vacuum, to researching vacuums, to actually purchasing is likely to stretch across multiple sessions.

Since all we have is just a sequence of telemetry data grouped within time, there are a vast multitude of hypotheses and potential narratives we could attach to any session. For a user that suddenly stops a task mid-session, there’s no way to actually determine if a user was discouraged, accidentally in the wrong flow, or distracted. This is where we need to rely on qualitative methods to learn just what is going on.

Making your own sessions

There will come a point in time in your work life where you’ll have to make sessions out of nothing. In general, this process is a giant pain in the butt! It’s an involved process that takes a significant amount of time to try, test, and iterate multiple times until you wind up with something useful.

In terms of mechanics, with access to event log data, you can actually use some rather intensive SQL or code to identify and create “sessions”. Effectively you just group together events that are within X minutes of each other and mark them accordingly. For example, Randy Zwitch here talks about sessionizing data using PostgreSQL window functions to define the time boundaries and then mark then with IDs. It’s not super hard, though the specifics quirks of your data may add a lot of complexity to the process. Doing it quickly and efficiently can also be a challenge that you have to specifically engineer around.

Sometimes session definitions are so complex, it’s more sustainable to offload the whole process into an ETL job instead of a massive SQL query. Code is generally easier to version control and debug than complex SQL. Depending on details about your data size, hardware, and infrastructure, one method might actually be higher performance than the other. You’ll have to try and compare to see.

Oftentimes, it’s much easier to mark things with a session_id as the data is streaming in instead of doing a large event table self-join or window functions. You could conceivably just have a caching system that remembers users who are within the timeout window and just mark any request that misses the cache as a new session start.

But setting up a streaming-based system requires that you know all the definitions to begin with (or be willing to accept marking and saving bad sessions in your data). The initial research to get to that point usually means you’ll have to do it “the hard way” at least once, if not multiple times.

But mechanics of code aside, what would you have to consider when you do embark on defining your own sessions?

First is obviously defining “what counts as activity”. I’ve had to work on projects where only very specific user actions counted as activity, and others where just about any user input, even from robots/scripts, would count. It really depends on what questions you’re looking to answer.

Then you’ll have to define at what point do users start and end sessions — this usually has a ton of unique-to-you edge cases that you have to work through, like the “new campaign” case for Google Analytics. Inactivity timeouts is the first place to start, but you’ll have to do a lot of manual research to figure out where to set that timeout.

I’ve had projects where a typical user session might be a multi-day journey, with potential system failures, gaps for sleep, and restarts. Meanwhile, typical user behavior with mobile apps are very brief and fragmented, so those will also look different. Then you have to think about times when people deliberately signal intent to leave, continue, or start a new session, like explicit logon, logoff actions. Should you use that information? Some systems do, some don’t.

You often have to try a definition, then attempt to run some analysis using that definition to see whether it is useful. What’s the longest session length? What’s the shortest? What’s the distribution of times? Are there any interesting patterns that come out when you look at it through the session lens?

You’re probably going to find patterns that make you wonder “what’s going on, is this right?” For example, a huge cluster of very short sessions. You’ll wind up going on a mini research trip to figure out whether that cluster “is real” or some artifact of your session definition (maybe the timeout is set too short).

Then you’ll likely want to try a couple of other definitions, run similar analyses, and compare. It’s essentially doing a sensitivity analysis on the definitions. The problem is that there’s no single well-defined optimization function to use to refine the definition parameters. The art in this is figuring out what sorts of distributions seem the most “correct-ish” and useful.

How do you know you’re done?

Probably never. But you will get tired of doing the research, which is about a good a place to stop as any.

In the end, you’re going to still feel a bit unsure about the quality of your session definitions, but unable to find any obvious flaws in the system as you attempt to apply it to problems. This point is probably the best time to just call the project done. Any further issues you find along the way can be incorporated into an update.

The increased friction you’re feeling each iteration is the slope of diminishing returns.

Counting Stuff