Thursday posts are short, off-the-cuff musings for paid subscribers of the newsletter that mostly touch on whatever’s on Randy’s mind that week,
Earlier in the week, I had put in an order for some items at an online store for a niche item. Then, after about a business day, I get the email that my item has shipped and has the attached tracking number. I, of course, click on the provided link to check out the tracking information from the carrier’s web site. And, to no one’s surprise, the tracking number can’t be found in the carrier’s system yet.
Having worked on similar e-commerce systems before, I have a pretty good idea what has happened. The store used Shopify as their e-commerce vendor, and the order handling system had sent me the email more or less when an employee at the store printed out the necessary shipping labels and packing slips to fulfill the order. For any number of reasons, the box with all my items in it has not be handed over to be scanned by shipping carrier and entered into their system.
The end result of all this is that I’m somewhat frustrated as a customer since I was promised information that doesn’t exist yet. The obvious ideal would be for the store to not send me my shipping information until the moment it is handed over to the shipping carrier and entered into the tracking system. For extremely large retailers that have a high amount of logistics automation, this can be more or less done. But for a smaller retailer, the solution seems significantly more messy.
And so, this week has me thinking about the processes that link digital data and physical reality.
For a lot of my work, and I think a lot of the work many data scientists do, the flow of information and data is usually in a single direction — things originate in the physical world and become digital data or stuff is digital data from the beginning and never leaves. This often manifests itself as customers and end users inputting data into a system, somehow sensors sending telemetry into our data pipelines. Because of focus, we’re used to thinking about the intricacies of data collection. We know to question whether our sensors are wrong or if we’re encountering user error or bugs in our quirky data.
But how often do most of us think about data flowing in the other direction, where digital systems are used to trigger some process out in the real world, like an employee going to the warehouse to pick and pack an order, that then can trigger another data event later. We’ve suddenly gone from the realm of “data work” into some form of process design.
Suddenly, all sorts of complicated things come into play, namely human behavior. As an example, take the store I ordered from and got that “too early” shipping email. We could probably improve the process by adding a step where we don’t send the email when the label is first printed, but wait until the item is packed and put on the loading dock. Workers could use a custom system to scan the packages then and use that to send off the emails and reduce the gap between the email and when the shipping truck picks it up. Solved!
But what happens if there’s an inevitable mistake and an outgoing box doesn’t get scanned. Now the customer won’t ever get their email and the item will just show up unexpectedly at their door. There’s probably be some customers that get angry and complain, maybe even initiate a chargeback.
Moreover, if the humans get overworked in the system, not only will the chance of errors happening increase, they might actually modify the process completely to suit their needs. Maybe they gather a list of all the shipping labels on the dock and wait until the end of the day to feed them into the system all at once. Maybe this is compatible with whatever process you’ve defined, maybe it doesn’t.
Why not have the shipping carrier send the email when THEY scan it instead — that’d be so much more reliable! Except that involves passing personal information to the carrier and opens huge privacy questions, assuming the carrier is willing to take that information in the first place.
Even for a simple example with a simple goal, this stuff gets very difficult and requires a lot of thinking, testing, and running trials. Imaging more complex processes that require feedback loops between they physical and digital spaces. It is something that is outside the usual skills that is expected of a typical data scientist.
But don’t be surprised if we’re asked to contribute to the design of such processes — since we’re primary consumers of the data generated by such systems, our input is often sought out during the design phase. We know enough to put in a specification that we need certain kinds of timestamps with certain kinds of guarantees in order to create data products that will give a certain result. For example, we can tell designers that we need to have the best estimate of when a package enters the shipping carrier’s systems so that we can send our emails out at the right time. It’s up to the process designers to figure out what’s the best way to give us that information while balancing all the other things they need.
I suppose being forced to take the role of data engineer for a bunch of things and being forced to take ownership of so many data collection related decisions, it’s refreshing to be able to defer to other people who have more experience in building out business processes. You can learn a lot from watching people who are experts at design, human factors, usability research, and other fields contribute knowledge hat makes stuff work.
And then, after a ton of work, equipment expense, and employee training, we might have a system that won’t annoy a customer with an email sent a day too early.
Totally worth it.