Attention: As of January 2024, We have moved to counting-stuff.com. Subscribe there, not here on Substack, if you want to receive weekly posts.
Over 5 years ago, I worked at a little clothing e-commerce site/brand as their data person. It was a great little team that designed and sold kids clothes in mostly solid colors and long lasting designs, and I learned all sorts of interesting stuff about things like factories, warehouses, and clothing production while there.
So in between projects there I had some time to do a bit of independent data research. There wasn't a specific goal or need in mind, so I happened to ask myself whether it was possible to see in the sales data how many kids our customers had.
The data setup was simple enough, there's a set of tables with every customer order and the items they ordered. All the clothing items had a value for the size, and so I figured I could use the clothing size as a proxy for the presence of a specific (size of) child. Since all the clothes were designed and made internally, all the size numbers were consistent. With just that information alone, I figured I can tell if a customer was buying for at least a kid of a certain height, though obviously I can't tell if they were buying for one kid or quintuplets.
Even with the caveats, detecting a second kid should be relatively easy, just look for orders at the same time that items in sizes that are far enough apart that the same kid probably isn't wearing both items. There's many ways to model this depending on what follow up analyses you want to do, but a simple way is take all the items in an order, and take the pairwise difference between all the sizes to see the most common size gaps.
Side note, kids clothing sizes are nonlinear, with babies getting ranges like 0-3 or 0-6 months, 6-12mo, 12-18mo, etc, while slightly older kids get whole numbers Toddler 1, 2, 3 that start off roughly correlated with age but eventually give up and become arbitrary number sizes. The only way to make sense of it all is to just assign them arbitrary ordinal integers and do arithmetic on them that way.
Anyways, given this data setup, to no one’s real surprise, I found that there was a very noticeable increase in orders that seemed to get sizes that are roughly 1, 2 or 3 sizes apart, centered around 2. That roughly correlates with how the average sibling is about 2 years apart in the US. While there’s a fair amount of handwaving involved in this preliminary report, it seemed I had something interesting to share — we can tell how many differently ages kids a customer typically shops for.
But what could we actually DO with this one single insight?
It’s not clear.
This whole project started off as a self-led fishing trip. I had a hunch that the data might be able to show an interesting fact, had some downtime between projects to explore, and found that the data did support my hypothesis. There wasn’t a product lead asking these questions in order to develop some grand new feature. There wasn’t even a business case set up to use the information in some kind of marketing campaign or recommendation engine. It was just a cool factoid with zero context.
And therein lies a lot of the danger with data fishing trips — they’re most likely to fail before you find anything interesting, but in the rare event that you do find something worth mentioning the most common question afterwards is “and now what?”. This is a very far cry from the mental image of throwing a proverbial pearl of knowledge out to our teammates — it’s more like I was lobbing a seeded pearl oyster at them. There’s definitely a pearl inside the ugly, rough-looking thing, but the pearl inside is of unknown quality and everyone first has to do the work to shuck and extract the value out.
I’m sure that you all can think of creative uses for a kid’s clothing retailer to get signal that a customer might be purchasing for multiple sized kids. Marketing emails are an easy answer assuming you can find a way to word them without being super creepy about it (hint: this is difficult). It’s also likely to be useful information for any recommendation system involved (of which the company had none at the time, “what’s popular” is hard to beat).
While the research work needed to arrive at the insight might have been a lot of work, it’s just the start. That work is vastly eclipsed by the what’s needed to turn that seed of an idea into something useful and real. Think of the product vision, infrastructure, engineering, design, and marketing resources needed to turn the idea into a full on recommender system that doesn’t yet exist. How many hours of work does that equal? What other work would have to be put on hold to chase the idea? Think about how you’ll have to convince all the people who make decisions for those teams to get on board with trying to realize that one idea.
Back when I was a junior analyst and data person, I could only really see the narrow view of my immediate work. It was hard enough to make sure that the repots I shared were of good quality and at the least as correct as I could make them. So understandably at the time, it felt like a lot of the ideas and insights I did manage to uncover didn’t really materialize in any way. Anything that did come from my work was maybe two years later and had gone through so many iterations that I didn’t even recognize what influence I could have had on it.
I was probably 8 years into my data career before I managed to stumble onto an insight that unintentionally caught the attention of some product and engineering leads. Those folk then quickly got their teams to try out ideas starting from that seed, and one of those experiments eventually sparked at 40% year-on-year growth trend for the company. None of that was due to anything I personally did besides sharing some charts around at the right meetings set up by my boss who saw the potential for what I had accidentally found and didn’t fully grasp the meaning of.
Nowadays, with many more years behind me, I have a much better view of how important it was to do all the sharing and attempting to inspire other people with things that I find. The utility of an idea is practically zero unless effort is put in to get an organization to work on the idea together. Lobbing things over fences rarely does anything. So, as a more senior data person now, I have to take on more of the work of sharing and helping other people see the value of a particular piece of data insight. More junior data scientists are too busy doing the hard work of making sure their own analyses and models are correct, so they don’t have the mental bandwidth to do more.
As data people, we often take on the role of connector. We bridge across teams because we speak everyone else’s professional languages — because we needed to learn their language to understand their data. For much of my career, I had thought that much of the value of that position came from listening and sharing information. But the trust and reputation that we build over time doing that connecting work actually serves another purpose — people are willing to listen when we speak. We should make judicious use of that sometimes, especially when people get excited about an insight but need help getting even more people on board.
It’s pretty darned cool when it all comes together.
Standing offer: If you created something and would like me to review or share it w/ the data community — just email me by replying to the newsletter emails.
Guest posts: If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me. You don’t need any special credentials or credibility to do so.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.
Support the newsletter:
This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Get merch! If shirts and stickers are more your style — There’s a survivorship bias shirt!
Someone said of science that the “Eureka!” moments turn out to be far less important than the “now, that’s interesting” moments.