It’s coming to the end of the year, which means it’s “plan for next year” season! Which, for us data folk, also means “report on what has happened to help plan” season. Of course all this means that data teams are experiencing an uneven period of relative quiet (because everyone is busy planning and winding down the year) and hectic business (because planning, projecting, and anticipating speedbumps in the future).
That brings us to this question here:
There’s a lot of stuff to unpack:
What to do when your ongoing survey changes?
How to best handle reporting when such changes are in play?
When should we invite this trouble by changing our measurement?
Before I go into this, I should take a second here with a disclaimer — survey methodology is a HUGE, super well-studied field because so many scientific disciplines rely heavily upon it. I only know one tiny slice, and don’t even consider myself an expert on that tiny slice. So when in doubt, consult expert survey methodologists and texts. Their opinions override mine by a long shot. I just run shady boot-leg surveys measuring very trivial things.
Changing surveys always causes trouble for researchers
In academia, questionnaires that have been validated (we’ll get to this later) are changed with great care and reluctance. The primary reason is because changes, even seemingly minor changes like question ordering or single word changes, can have significant and unpredictable effects on how people answer the questions. When you’re using a well studied psychological construct (like say, personality via the Big Five) you want to reuse the exact same instrument so that you don’t risk invalidating your entire study based on a trivial, avoidable, issue.
The problem is that in industry we rarely have well-studied, exactly worded survey instruments in common use… (ok, NPS technically is a oft-used one, but I’d argue it’s not super well studied and nor well validated). So we very often make up our own questions on the spot, project to project. Then, since things are constantly changing in industry, we’re often put in a position to consider changing a survey for various reasons.
The biggest issue is that, because small changes can have unexpected big effects, you have to be extremely careful when comparing one version to another. (Pew has a brief article about writing survey questions that touches on some of these topics.)
Doing surveys the “right” way
In an academic setting, there’s a lot of work that needs to be done to “validate” a questionnaire instrument. For example, let’s say I’m trying to understand whether a user is satisfied with using our product. The theoretical construct we’re trying to measure is “satisfaction”. The operationalization of it, aka the practical realization of a measurement of satisfaction, will be the question “How satisfied are you with [product]?” measured on a familiar 5-point scale of “Very dissatisfied” - “Neutral” - “Very satisfied”.
There are of course multiple ways to operationalize the same concept of satisfaction. For example, we could ask about dissatisfaction instead, and if we think of satisfaction existing on a bipolar scale, then dissatisfaction ratings should be in the opposite direction as satisfaction ratings. We could also have a 7, or 10 point scale, or one that starts with neutral and goes up to “super ultra happy and satisfied!” if we wanted.
The point about validation is to prove to ourselves and our peers, that what we set out to measure, is actually what’s being measured, sometimes called construct validity. It needs to measure things consistent with how we theorize “satisfaction” works, going up when users are satisfied, down when they’re not, and not changing just because it’s Tuesday. Maybe we need to test various ways of wording the question to see what gives us the most consistent and true results.
We also need the survey to do things like be reliable, giving the same results when asked the same people again and again. It’s not helpful if a user says they’re satisfied one day, then in a few hours (with nothing significant changing), they say they’re dissatisfied. There’s also all sorts of biases (response bias, sampling bias, non-response bias, etc. a nice list of many available here) that we need to be aware of in our survey design.
It takes lots of research work, sometimes including full individual studies, in order to validate the many pieces of a given survey. And guess what happens if you make a change? You probably need to re-validate the survey to show that it functions consistent with the previous version, and understand how things change between versions. Even if you don’t do every validation test to the same extent and detail, you do still have to show that it is measuring what you intend to measure.
This is why it can take years the new version of a well-studied survey instrument to come out. For example, if your questionnaire is used as a potential assessment tool for psychopathy, with all sorts of potential serious consequences for subjects, you don’t screw around with it just for fun.
We rarely do any of that in industry
I know there’s a bunch of serious researchers with academic research backgrounds who do take survey methodology extremely seriously, people involved in political polling and market research come to mind. I’m not talking about the wonderful work of those folk.
I’m talking about people like myself, on the front lines making cheap disposable surveys to measure stuff like customer satisfaction, documentation quality, etc. that might get looked at for a few quarters and then changed. By the time you could validate all the survey questions, the question at hand might already be irrelevant. So we cut corners. LOTS of corners.
About the only thing we’ll do is make sure the survey questions pass the initial “face validity” test where we check that on the surface it seems to match what we want. For example, our question about satisfaction actually uses the word “satisfied”, and not “happy” or something else. It’s an open question whether the reader’s understanding of “satisfied” matches our notion of “satisfaction”, but it at least sounds pretty close.
And of course, if we can’t be bothered to validate our survey at the very start, we certainly aren’t likely to validate any changes going out. So when we’re suddenly put in a situation where we’re not sure what has changed, due to what factors. Oh no.
What CAN we know?
Let’s say we took our 1-question satisfaction survey and decided to change it. Instead of “How satisfied are you with [product]?” we asked “Did [product] satisfy your needs today?”, again on a 5 point scale. Why? Because we somehow thought it was a better measure for whatever we were doing. *Shrug*
If we’re willing to make the (very big) assumption that the underlying construct of “satisfaction” hasn’t changed (as in, we haven’t changed the product experience at all), then we can attribute most of any changes in the measurement value (say, 40% “satisfied” to 30% “satisfied” going to the new version) to the change in how people are interpreting the question.
The problem is the big assumption that nothing has changed in the underlying construct and also the measurement characteristics. If your sampling population demographics shift in any way, that could mess things up. If the site changes in any way, you’re done.. It makes the assumption increasingly difficult to justify over any significant period of time. Also, since we typically implement these surveys in order to measure the effect of product changes, it would defeat the purpose to assume that entire reason for existence away!
Once you assume that 2 things (or more) are moving simultaneously, the underlying satisfaction, as well as the question itself, we can’t know what caused which effect. Imagine if you launched your new survey and immediately after, the site went down for a day due to the datacenter catching fire. Did satisfaction go down because of the survey, or the outage? Even if you exclude the time around the outage as an outlier, just how do you define who is and isn’t an outlier?
The only way to work around this problem would be to run a test where you have BOTH versions of the questions running in parallel, either in a within-subjects or between-subjects design. Then, you can at least figure out how the two measures relate to each other to give people reference points. For example, if we find that there’s a consistent 10 point gap between the two, we can translate between the two in reporting.
But what if I totally didn’t plan this and I didn’t run a test?
Well, it’s time to get very creative. One option is getting very handwavy with a pre/post analysis. Because there’s nothing else we have access to. It’s the least ideal solution, but it’s something.
You can look at the period right after the launch of the new survey and compare it against the old data. Go as far as you can reasonably make the “nothing besides the survey has changed” argument. That’d be as good of a reference point as you’re ever going to get.
The key point is that no matter how you arrive at this rough translation factor, if you even find it, you must very clearly state up that the surveys have changed and comparisons are unreliable guesses at best. Sometimes in these situations, it helps to have the exact differences on hand to highlight if the question comes up. Stakeholders, unless the also have research backgrounds, are mostly going to be trusting the surface validity of the measures so the level of rigor required is lower than you would expect.
But despite all the smoke and mirrors, we’ll never really know what happened when we swapped questions.
So why do this? Why make changes? Why report them this way?
Because what we’re interested in has shifted somewhat. We’ve learned that our initial wording wasn’t yielding useful results and want to make them better. Or some other very reasonable reason to submit ourselves to this extra work. I assume that if we’re even having this discussion about changing a long-running survey, that we believe it’s fundamentally a good idea.
But if you think about it, it’s not the actual survey questions and data collection itself that’s causing us the pain. It’s actually the reporting of the survey results. We want to run tests and characterize the differences between version 1 and version 2 because we’re reporting on some long term trends and want to account for the change.
Hold up. We may want to account for the change, but is it actually necessary?
What’s stopping us from simply saying that “we thought v1’s questioning gave bad results [example], so we switched to v2. The results aren’t directly comparable but we think they’re more reflective of reality”? Just call out that a big change happened and find a way to move on with the analysis.
This method actually goes over pretty well in less rigorous contexts. Since the construct of “satisfaction” is always nebulous and prone to fickle user behavior, people tend to (rightly) care more about trends over time than individual measured values. Sure, doing big metric reset like this does take away your ability to see trends for a few cycles, but if you are measuring often enough it might not matter. Otherwise, it might be worth planning some small validation study as a cross check.
About this newsletter
I’m Randy Au, currently a Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. The Counting Stuff newsletter is a weekly data/tech blog about the less-than-sexy aspects about data science, UX research and tech. With occasional excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise noted.
Curated archive of evergreen posts can be found at randyau.com
Supporting this newsletter:
This newsletter is free, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Tweet me - Comments and questions are always welcome, they often inspire new posts
A small one-time donation at Ko-fi - Thanks to everyone who’s supported me!!! <3