In the future, we'll work with unstructured data at ~Scale~

Where are the limits? Edge cases?

Dec 05, 2023

Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.

Lego cherry blossom model, which uses pink frogs for the flower buds and is super cute

November 30th marks the 1 year anniversary of ChatGPT being thrust upon the world and throwing gasoline on top of what had already been a growing bonfire of “things being branded as AI”. Since then, the entire world has been locked in a global rehashing of a lot of the discussion around Searle’s famous Chinese Room Argument while people rediscover that humans will form emotional attachments to simple computer scripts coded in the 1960s let alone one with billions of parameters.

On average, my undergrad readings in philosophy of mind has made me an skeptic for a lot of the things billed as “AI” these days. But the past Friday, Benn had written yet another great post that highlights one of the actual use cases of LLMs that I don’t think is complete garbage. The setup is (paraphrased) as follows:

Imagine if you quit your job and open up a bar. You’ve used all your skills as a data scientist to make sure you collect maximal high quality data about every aspect of operation of the bar. Money, expenses, inventory, supplier prices are all perfectly recorded, you’ve got a customer loyalty program that lets you attribute all sorts of demographic data to every purchase. But the bar isn’t doing well, it’s just not really bringing in enough customers to break even, though you’re always sorta close.
You take a vacation for a week to gather your thoughts about how to turn the business around, and during that time the friend you left in charge bribed customers with free beer to record 2 minute video feedback about what they thought about the bar. By the time you come back, you’ve got over a day’s worth of video sitting on a laptop waiting for you to review.
But at the same time, the bar’s board of director’s (this is apparently a Very Serious bar venture) has given you an ultimatum — give us a comprehensive plan to turn the business around by tomorrow or you’re out.
The question to Benn poses is this: under tight deadline pressure to deliver insights which data source would you use to save your bar and your job?

His notes that it is obvious that the answer to the turnaround question lies somewhere in those hours rambling videos and not hiding in the terabytes of financial/behavioral data logs you’ve collected in your panopticon of a bar. But the data industry has for the past many decades largely sold solutions to tackle all that carefully collected structured data. The tools are almost exclusively in this space, and if believe that funding is aligned with utility you’d be wanting to use the structured data. Squeezing meaning out of hundreds of hours of video isn’t in the domain of “math” nor “databases”, that’s social science. That doesn’t scale!

I personally attribute some of my success as an analyst to the fact that I’m usually more willing than my colleagues to identify that a given dataset is too unstructured to analyze at scale with a fancy technique and just sit down and start burning hours (and brain cells) manually reading and coding freeform text responses. Managing to squeeze some themes out of a pile of data, even with a boatload of caveats, is sometimes extremely important in tight deadline situations.

Benn points out is that the gap between structured and unstructured data tooling is going to need to change. LLM-based models provide a novel, seemingly promising, path for tackling unstructured data and making it accessible. Maybe the industry is ready for this shift, maybe not. But we can’t ignore it because it’s definitely coming.

The traditional way to do this ugly social science work was to roll up your sleeves and do things that “qualitative” (*gasp*) researchers did — sit down and watch those videos, take notes, and synthesize the notes into findings identifying common themes and patterns. The primary cost of such methods is the sheer time it takes. Humans can only process information so quickly, and we also need rest. Naturally people have tried to scale things by employing multiple humans to agree on a coding scheme. But once you add more people, the next problem is that there is always going to be coordinating and making sure that ambiguity that needs to be dealt with. Different adults — even highly trained ones — can disagree with how to interpret certain instances of ambiguity and so someone (namely, you) need to develop the rules for what gets labeled what.

In the many years prior to large language models, there have always been people working to try to make the analysis process less tedious. Word clouds, clustering with word embeddings, topic extraction and other various NLP methods were all used with varying levels of success to lighten the burden a tiny bit. LLMs, with their ability to give seemingly useful summaries of text have been tested for various tasks like identifying themes and coding text. The current literature is not even a year old, so it’s thin and will change rapidly, but the couple of papers I’ve seen all seem to agree that LLM-based text analysis techniques can be as good as human raters, sometimes even as good as expert raters, under certain conditions. It’s far from a set-and-forget type thing, the prompts have to be crafted well, you often have to provide the coding scheme or some examples, and many other important implementation details. I’m sure we haven’t really found where the limits of such methods fail yet.

But the overall takeaway to me is that, it is at least possible to take away a significant portion of the drudgery associated with working with unstructured data. For example, even if I assume that an LLM is garbage for identifying the themes in my particular problem, it at least seems possible to put together a system that can help me filter out the inevitable uninformative junk responses that come in from survey forms. Things like short glib answers to complex questions, complete nonsense, etc..

Using LLMs this way isn’t a particularly novel idea. I’ve heard multiple people from multiple companies say they’re experimenting with such tools already. There are clearly researchers publishing about this out in academia. And of course there are a myriad of startups trying to make and sell products in this space right this moment, with more to come.

But since I’m not an AI/ML expert, nor want to talk about the many implementation and training details that are involved in building and testing such systems. I’m gonna do the philosophy student thing of just accepting the premise to an extreme and seeing whether that leads us to interesting spaces. So for the sake of argument, let’s say that in a couple of years, LLMs for this particular use case are perfected. We now have, for the price of some GPUs, electricity, and a maximum three hours of setup work, the ability to summon an infinite number of human graduate students, all identically trained with my study-specific instructions for coding themes. Said students are even trained to note down new unfamiliar themes for further analysis later. I can then tell said army of grad students to read an infinite number of open text customer feedback (or read transcripts of video/audio/whatever) .

Under these ideal conditions, what do I get for that hypothetical failing bar owner scenario? And what do I not get?

Obviously, we get the ability to go through hundreds of hours of video with the help of some auto-transcription ML models (which are also increasingly decent). We can work with the unstructured data now at massive scale. How are we going to make that actually useful instead of a waste of electricity?

In the bar problem, our goal is to find ideas likely to turn the bar around. Inside the feedback, we expect to find many examples of why people are unhappy with the place, or things they believe should be improved. For example, maybe they dislike the ambience, or one of the bartenders for the late night shift gets confused with drinks all the time, or maybe the line to the bathroom rivals that of a theme park.

The plan of attack is straightforward: identify a list of themes, then quantify how often each theme comes up. Some subset of the most common themes is likely going to be part of the answer for turning the bar experience around. Put the ones you think are going to be effective into your turnaround plan.

The rule of thumb for qualitative studies is that if multiple people are saying the same thing without any prompting, that’s usually a huge problem because it’s like running a survey and finding 90% of people respond in the same way. We use this rule of thumb because we can only run a handful of qualitative interview studies a week, read that much feedback a day. With our new infinite coder pool, we can overcome that limitation as far as our recruitment budget allows.

So we set our magical grad student army upon the videos in search of themes to quantify. But here’s where I think we would hit upon a problem. Where does the original list of issues, the coding scheme, to quantify come from? Us, the researcher. I am currently still skeptical that LLMs have shown ability to generate by themselves a coding scheme on their own. Don’t get me wrong here, they can certainly be asked to generate a list of themes summarizing a pile of text. What I don’t know is whether they, or even an army of human grad students, can help me generate a coding scheme for me that is fit for my specific research question without significant input and supervision from myself.

I’ll unpack that a bit more. As the bar owner, I’ll have a bunch of hypotheses for why my bar is failing. If I’m a really experienced bar owner, I’d have a fairly comprehensive list built off my priors. But part of the reason we need to into the feedback to begin with is because our existing knowledge doesn’t seem to be working — otherwise the bar wouldn’t be failing because we weren’t sitting on our hands all this time. We are missing something. And thus, what we demand of the unstructured data is this — we want to be surprised. The most valuable insight the feedback can give us is a spark of inspiration of something we have not considered. This is a tricky problem because we’re not just looking for the most common or obvious stuff. We might already know many of the obvious things might even consider them unhelpful noise.

Tell an army of grad students to pull out themes from a bunch of unstructured data and they’ll come up with a decent list since they’re smart people. But the act of summarizing data means they’re deciding what to throw out and what to keep. Since they don’t have access to the exact knowledge within my brain, they’re necessarily going to have trouble surprising me without just throwing every possible theme they can identify at me. It’s up to me as the lead researcher to pick out what themes I actually care about to analyze.

This same process happens when researchers are building the coding schemes from scratch. You first go through the data and propose themes. Then you realize some themes aren’t useful, while others need to be broken down into more specific ones. Then you iterate and do it again and again until you have a coding scheme that’s suited to purpose. You can speed the process up with the help of manpower or LLMs, but that work of directing the classifications to the task at hand must always be done.

To put it into AI-speak, all AI models require some kind of context to do their work. In situations where we want to be surprised, a significant part of the context lies within our heads, with little way to articulate it to an outsider. To broaden the problem even more, out of all the possible issues people can have, even the surprising ones, only a subset can be meaningfully addressed by the owner. There’s little to be done if the bar is struggling because the city is doing street construction nearby and it reroutes traffic in the area every weekend. The only way to overcome this context hurdle is to do a lot of work to actively explore and iterate approaches.

But once the coding scheme has been developed, the power of our infinite army can be unleashed upon the unstructured data. Now, we can count how often various themes come up and use that information to analyze why our bar is failing. We can now generate hypotheses about bar health, and use the data to test. Our in dealing with the unstructured data has shifted — we’re daring the world to surprise us by proving our clearly measured hypotheses are false. The only thing that really stops us from accomplishing our goal here is merely the tedium of counting up the data, which is solved by our magical army.

So anyways, to focus everything back. I’m proposing that when trying to understand and solve a business problem with data, we’re actually doing one of two very different things depending on what stage of analysis we’re in.

We are asking to be surprised
We are daring the world to surprise us, but prefer not to be surprised

If you’re familiar with design process language, we’re ideating and expanding our possibilities first before testing and narrowing down later. Having an infinite army of human-like coders working on unstructured data can most definitely help with both aspects, but what kind of help the army can provide is different. The needs are different, the use cases will be different, and most importantly the experiences will be different. I think this is important to recognize that if the industry as a whole is going to dive into a big “let’s conquer unstructured data!” rush because it’s very easy, and plausible to attempt to do the opposite of what you need. The tools are very similar but you have to wield them with different intention. There still needs someone at the helm intelligently making sure the army is continuously directed towards solving the actual problem at hand.

I’m sure the market for such tools will get increasingly bigger as the techniques mature. We’re also going to find out just how these tools are going to break down when applied to broader questions. As a researcher though, I’m going to just celebrate that I can eventually do less tedious work outside of spot checking.

Community sharing stuff

pyOpenSci’s is a non-profit looking to provide peer review of scientific Python packages to help open source science, as well as helping scientists share their code and other stuff. They’re looking to add volunteer editors w/ spatial data experience to their process. Editors lead the review process for 3 to 4 packages a year.

Details about what editors do is here. And there’s a form to submit if you’re interested in helping out.

Bluesky DataFeed

If you’re currently on Bluesky, I’ve created a feed for data people to share stuff. Just ping me there (@randyau.bsky.social) and I’ll opt you into the list that currently feeds it. Once you’re on the list, a non-reply post that uses either the 📊 or 📈 emoji will automatically get shared on the feed.

If you AREN’T on Bluesky, a bunch of us on the discord have invite codes and can share one if you kindly ask. It’s not currently a replacement for Data Twitter, but maybe one day it can be.

Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.

Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.

About this newsletter

I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.

All photos/drawings used are taken/created by Randy unless otherwise credited.

randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord. We keep a chill vibe.

Support the newsletter:

This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:

Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!

DANIEL TSCHICK TOMAZ

Dec 5, 2023Liked by Randy Au

Great article!

I'll have to read again tomorrow, to digest everything. There is valuable insight here.

Expand full comment

Benn Stancil

benn.substack

Dec 19, 2023

I really like this "surprise us with something new" vs "surprise us by telling us what's old is wrong" framing. Before last week, I honestly would've thought that LLMs would've been pretty good at the former thing - and even more so after reading this tweet from an OpenAI researcher who talked about how LLMs "dream": https://twitter.com/karpathy/status/1733299213503787018

But I read a book last week by a guy who did a bunch of (quantitative) analysis with OpenAI, and he said that was exactly what is was bad at, which was surprising to me. But it does make me wonder - maybe the dream analogy is actually closer to the truth than I realized, because dreams aren't really new experiences; they're repackaging of our experiences in a new configuration. It's old material, jumbled up. Even if LLMs have other material to draw from (they are trained on everything anyone knows, after all), the way they work basically keeps those dreams localized around the ideas they're being asked about. So it's not quite like they're limited by their experience, but they're limited by a kind of distance. And though they can bridge really big distances easier than we can - eg, write a Shakespearean sonnet about McDonalds' french fries, in the style of a pirate - they have to be told to do it. There aren't I don't think the spontaneous connections that happen so easily for people.

(And trying to provide that stuff is very hard. On a few occasions, I've tried to use ChatGPT for help on something, and realized that the most useful part of the exercise is typing out all the context to stuff into the prompt. It's a form of rubber ducking, where going through the process of explaining was much more useful than what it gave back.)

Counting Stuff

In the future, we'll work with unstructured data at ~Scale~

Where are the limits? Edge cases?

Community sharing stuff

Bluesky DataFeed

About this newsletter

Discussion about this post