Reflecting on how system complexity grows
Flashbacks of past pain
Last week, I took part in a brief but interesting (to me) Twitter thread, below:
It’s not a long thread, but the gist was that a couple of us data folk with experience dealing with physical goods processing /shipping started… reminiscing… about the mind-destroying complexity involved in working with data in such environments. The main thrust was that the business logic became extremely convoluted and reporting anything accurately was a big effort.
The conversation stuck with me over the weekend, and some gears started turning. I worked across a couple of industries and with all sorts of data before. Why is it that the huge mess of data in a giant global cloud ecosystem seems simpler than a tiny startup that makes clothes?
Why was it that data systems that work with logistics so easily become nightmares compared to systems that handle digital services, like software products and subscriptions? There doesn’t seem to be anything special for physical goods per se, you see similar complexity in things like healthcare and book classification systems. But there definitely seems to be something driving the complexity beyond adding new features and services that’s curious.
A brief dive into physical goods handling
A number of years ago, I worked as a data engineer at Primary, a e-commerce site that sells colorful kid’s clothing. It was a fun startup in NYC and while I didn’t have a kid at the time, they had a good product, good execution, and the people were fun. What was super fascinating to me about the company was that they were quite vertically integrated (to the extent that a < 30 person startup could vertically integrate) — they designed the clothing in-house, had someone source cloth from factories, contracted others factories for the clothing production, accepted the complete shipments at a warehouse vendor, then sold it online at their store. The whole setup gave them cost benefits and overall supply chain flexibility as their product needs changed over time.
From my very limited understanding of the apparel industry, this sort of integration is not typically done unless you were a big brand with lots of sales volume, employees and overall industry knowledge. Smaller producers and designers would typically have factories and other middlemen handle some of that work, for an extra cost.
This was all really fascinating, but the complexity it added to data was beyond my ability to grasp. The parts I mainly worked with had to do with the e-commerce store, which was simple enough. There was a table of orders, a table tracking shipping status, a table of customers, a table of various addresses, and a table of stock and inventory. All of the data was handled by the one e-commerce app doing the usual e-commerce app things. I was primarily working on reporting and analysis and running/calling experiments during my time there.
In terms of a web app, it wasn’t too complicated, customers ordered items, payment was handled, things shipped. Reporting on that stuff was generally pretty simple, with only a small handful of the usual gotchas. It was when we started having an increasing desire to look at how that e-commerce system interfaced with all the other systems that complexity exploded.
The finance team wanted to track the average cost per item sold, because it’s an important to know whether the operating margins were changing favorably or not. Except clothing would come in huge pallets a couple of times a year. Each shipment would have a different associated cost due to the specific contracts in place, materials used, and even shipping costs for that particular pallet. There was enough change eat shipment that it mattered.
Retail customers obviously don’t buy an whole pallets of clothing at once, so you have to break the pallets up at the warehouse. And since we don’t want to avoid stock-out situations, we’d try arrange things so that the pallet of new clothing items arrives before the old stocks run out.
So, from where I sit seeing just a database of orders coming in and shipments going out, how can we take the system to the point of reporting some kind of average cost per item sold? Effectively, it’d mean knowing how to associate each item going out with a specific pallet, but the existing database didn’t have some concept of “production lot number” to use to attribute shifting cost to. The shipments were also tracked via spreadsheets since they move very slowly via boat.
To keep sane we’d have to abstract away some of those details make assumptions about items being sold on a First-In-First-Out basis. Then we’d somehow have to track the stock levels in the warehouse to know when you’ve used up enough items to switch to the next pallet of the item, and update the costs tables in production accordingly. Except the warehouse vendor had !@#$!@$ for data integration, so instead they just FTP a CSV file with shipments that went out roughly twice a day.
Oh, and inventory counts sometimes don’t quite match your online system (sometimes the stock count is off by a little for various reasons). It costs money (and time) to request the warehouse do a stocking reconciliation to give accurate numbers, so we can’t sync the systems constantly either.
Oh, and did I mention that sometimes items get returned and after cleaning usable stock might be returned to the shelf? What’s the original cost of that item there?
Finance’s way of dealing with this whole giant mess was… to not engage with the problem directly. They zoomed WAY out and just tracked items going in and out on a big monthly spreadsheet, using weighted averages where needed. Their team managed to get it to a reasonable estimation, but they usually only managed to calculate the costs at the end of every month when they close the books. They had come to me wanting to turn it into a more automated continuous process.
In the end, I couldn’t even deliver the plans for a solution because they were going to integrate with a big, expense Enterprise Resource Planning (ERP) software solution to help track and manage these details all the way down to the factory/cloth production leve (such software actually involved the suppliers integrating and updating status/costs/etc in the ERP system!)l. My plan was to wait until the we had the ERP around and then see what API would be available augment our production site with. It beats having to invent a giant Rube Goldberg machine to solve the problem only to have it thrown out later by product designed to tackle the exact problem.
How things got so complicated
In the story above, the web app part had an extremely simple view of the world. Just about everyone with some coding experience can roughly see how the basic site and data model is put together. Even if there were various wrinkles involved like customers getting discount codes, multiple addresses, returning items, etc. In general the engineers behind the software kept things simple, because simple is comprehensible and works.
So, after pondering on this for a weekend, I believe my answer is — explosive complexity increases often come from having having to handle “another system’s concerns”.
The concerns of the “run the store app” team are largely independent of the concerns of the manufacturing team, the finance team, the online advertising team, the customer support team. This is even though all the teams (and more) come together to create the whole.
Everyone builds tools and models of their world that track what they care about, to the exclusion of everything else. The ad team cares if customers who arrived from ads eventually purchase, but they usually don’t need to worry about whether the orders shipped. The finance team cares about the expenses and revenue and overall accounting, as well as how orders are handled and reported on statements. But they’re not usually interested the mechanics of how orders are taken, shipped and handled.
The complexity increase is when these universes collide and are forced to reconcile their views with each other. Edge cases that are easily ignored must now be fully accounted for. Business logic pops up that seems completely arbitrary — why would returned items suddenly take on the cost of the most unit cost instead of the original pallet it came from? Why are we ignoring the cost of the return? Because it’s already hard enough as it is and doing it “right” would be even harder.
What you wind up having to do is creating a system that is a superset of all the intersecting ones. At the minimum it needs to have the primitives in place so that it can express all the concepts of every system — orders, and inventory, pallets and outbound shipments to customers, costs and revenue and ads.
The data model for such a system looks very much like a complex data warehouse that’s not a really a data warehouse, because there’s transactional bits throughout. This is where you start seeing people talk about Reverse ETL, glue layers, and other recent patterns of data design.
Universes colliding are more complicated than simple feature creep
I’m sure you’re thinking that systems that have had time to build up a large feature set must qualify as complicated too. After all, a system that has 1 action can never be as complex as a system that allows 100 actions. But the complexity from having two universes of concern collide feel different from continuous feature development.
Feature development is typically an extension of the existing system. It’s designed from the very beginning to interoperate with what came before. Hours of engineering meetings are had to hash out how things play well. Even if there’s a certain amount of refactoring involved, the fundamental assumptions about what’s important and how transactions work are consistent.
Meanwhile, there’s never a guarantee that different team concerns share any assumptions or primitives at all. One system only sees every event through the lens of credit card numbers, another only speaks user login. All that needs to be somehow reconciled before we can analyze and report on the data that’s generated by such a system. And we haven’t even poked into the messy logic that links all the states of the systems together.
So what’s the good of thinking about this?
Having been pondering this idea for a couple of days, I think the most important thing to come out of the line of thought is predicting complexity of systems in the future, because every system we use today is usually growing and changing over time.
As a system that generates data grows and picks up features, it’s bound to be more complex to analyze. But such changes are rather linear, and can be easily picked up on the fly by extending existing knowledge. It’s nothing to be overly worried about.
While the case of joining multiple concerns together into a single system requires what feels like an exponential amount of complexity — the new unified system is inevitably going to be more complicated than the sum of the parts. The work just to learn to use such a new system (let alone build it) is mind-blowing.
And this is why I do my very best to avoid building Yet Another Data Warehouse.
No one sent in anything for me to share/review this week. So if you have anything for next week, send them on over~
About this newsletter
I’m Randy Au, currently a Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly data/tech newsletter about the less-than-sexy aspects about data science, UX research and tech. With excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise noted.
Curated archive of evergreen posts can be found at randyau.com
Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.
Supporting this newsletter:
This newsletter is free, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Tweet me - Comments and questions are always welcome, they often inspire new posts
A small one-time donation at Ko-fi - Thanks to everyone who’s supported me!!! <3
If shirts and swag are more your style there’s some here - There’s a plane w/ dots shirt available!
I like to grab publically available data and analyze it. Tax and assessment records, police incidents, covid case, vax, hospital rates, etc. I realized that all datasets are built for specific purposes, with a simplified model of the world. Which is often not the model I want to ask *my* questions against. For example, police incident records have the precinct of the officer, and address of the location. It became clear that the police wanted to track incidents that officers were involved in, while I wanted to track incidents that occurred locally. There were a small number of incidents that occurred well outside the city, some even in other states. So I had to eliminate those incidents. Trying to use precinct to disambiguate addresses (no zip code!) was also a trial, since it was the precinct the cop was based in, not the precinct of the incident, so that involved lookng at neighboring precincts. Data fun and games.
I've been doing a lot more reading on complexity and scaling systems, and this reminds me of the adage (I'm butchering this) that "tightly coupled systems can't scale". There was a great article on it in Towards Data Science recently: https://towardsdatascience.com/designing-data-systems-complexity-modular-design-384b28fec672