I write fairly often about how having an understanding of how complicated systems function can prove to be handy at surprising times. But I normally write about the subject in the general case because it’s hard to go into the details about proprietary systems at a current or previous employer. I don’t have clearance from whatever powers that be to share various technical details of things, plus there’s a lot of context to have to convey.
But as (un)luck would have it, I got to experience something as a complete end user of a third party system, so I can talk about it to my heart’s content! So let’s get into it!
Intro to some banking screwups
My spouse and I are in the process of getting our finances in order for some large purchases in the summer, moving stuff around various accounts. At the same time, a very high 4-digit payment was coming due for a credit card. Spouse had forgotten that they had already scheduled a payment for that 4-digit charge coming due, and accidentally initiated a duplicate payment to the CC. (The CC’s website UI didn’t have any indication that there as a scheduled payment, but that’s a crappy UX rant for another day).
So, what happened was that duplicated 4-digit payment requests got sent 1 day apart to our checking account bank. That particular bank account didn’t happen to have the 5-figures of cash sitting there to cover the double charge, so we got thrown into insufficient funds and we get notified that Bad Stuff is happening.
A fix is attempted
So, having noticed that the problem was a duplicated payment, spouse puts in a call to the CC bank and tries to get one of the charges reversed. Everyone agrees that this was a silly mistake, CC bank’s rep puts in some kind of refund request. We wait a day for it to process, and it hits the account the next day, problem solved!
Yay! We’re done. Newsletter post over! Right?
Now, things get properly complicated
A couple of days after the fix, I check the bank account again, and…. we’re at negative balance gain.
W. T. F.
Apparently, CC company had initiated a RETRY PAYMENT for the same large 4-digit sum… It’s the good ol’ “hey, if a payment request fails, just try again a few days later” tactic. 99% of the time, for legitimate payments there’s usually no problem with retrying in this way. Sometimes bank/CC systems apparently behave oddly and have transient errors, and if you just retry a failed payment, it’s very likely to go through the second time. This largely works if the payment request was made in good faith. I know of multiple e-commerce platforms that do this with no negative effects on consumers. It often works better for the customer because the customer doesn’t get a service/product interruption for random bank processing issues.
But this time around, the payment request had been an error, so it shouldn’t have retried. Nevertheless, it did and now we have to deal with the aftermath. Whatever way CC-bank’s representative had initiated their initial payment reversal, it didn’t communicate to the rest of their billing/payments system that the retry should not be attempted. It’s possible the representative, a first-tier support person, might not understand the payment processing system to that level of detail.
Spouse, who knows nothing about banking systems, or data processing systems in general, obviously does not understand this (nor would any layperson be expected to). Spouse is understandably upset about the mess. So they put in a call to the CC bank, explain that we’ve now a 3rd charge and we’d like our money back ASAP.
The rep on the other end gets highly confused at the situation, claiming that CC-bank’s refund had been rejected by our $$-bank. Currently, from the rep’s point of view, our CC has huge 4-digit positive credit on it, leading them to that conclusion. So to correct the situation they offer to either mail us a check refunding that money, or direct deposit it back to our $$-bank. We agree to this check method for now, knowing it’ll take a couple of days to process due to the weekend so I can go talk to the other bank.
While this is going on, I’m talking to $$-bank and asking what do they see from their end. $$-bank says they see the original 2 charges, then the speedy reversal, and finally the 3rd erroneous charge. Having a very rough sense of how the ACH system works (we’ll get to this later) I ask what would be the best way to fix this crazy situation. Would it be best to wait a few days to let things settle?
$$-bank agrees that waiting is probably the best option. Our bank account has a negative balance, so $$-bank won’t honor the payment request, and it’ll eventually refuse to pay CC-bank, and returning the 4-digit amount back in our account. That will put all the balances across all banks back correctly. It’ll trigger an insufficient funds fee, but they’ll take care of that for us. Great.
Figuring things out and coming to a resolution
Getting confirmation from $$-bank that “wait and see” is a good strategy, I call up CC-bank again. We get the first-tier rep to take a look at the conversation record from a few minutes prior. The rep wisely wants nothing to do with our case and instead transfer us to the payments team. I don’t blame them one bit.
We luckily get transferred directly to a payments team supervisor. I explain the situation and how we’re considering waiting and seeing how the accounts settle. They look at the records and see that our CC account is in a chaotic state, w/ various payments and refunds queued up. The supervisor slowly untangles the mess, and the ultimately agrees that cancelling all the pending actions at CC-bank and waiting for the accounts to settle would probably put us in a good spot. There’ll be some refused-payment fees but they’ll waive those too.
Then we just wait a couple of business days to have it cleared up. Phew. Now we just wait and hope the dust settles the way we expect it.
Update: as of Monday EoD, the accounts did actually settle and things seem fixed now
So, what actually happened?
The reason for much of this chaos is because all these payments aren’t happening in real time. They’re ACH, automatic clearing house, payments and thus, batch processed. Yes, it’s 2021 and the primary vehicle for transferring small sums (under 100k) around the US banking system is a batch-processed settlement system that traces its origins to banks physically bringing paper checks to clearing houses to clear transactions. The mechanics, while having long been digitized, are largely the same.
To make things more interesting, these clearing house systems aren’t batched daily any more, but in multiple batches in a day in an effort to clear transactions faster while not actually being real time (which would require a different protocol than the existing system). Payments from participating banks send a bunch of transactions together as a batch, the clearing house takes all the batches, does the math, and tells all the banks what their net differences are, and where all the debits and credits are supposed to go. Then it all settles and balances are reflected correctly. So, depending on when certain credits and debits are initiated, things could be considered on different business days, and thus settlement of accounts can lag.
Europeans reading this are probably horrified to hear any of this because they’re likely used to real-time transfer systems, like the ECB’s TARGET, These systems will transfer money between accounts super cheaply, and practically instantly with little risk of settlement/credit risk (bounced payments) because they actually clear in real time. Currently, the US’s largest Real Time Payments network is The Clearing House’s RTP network has decent coverage but not all banks participate, and the Federal Reserve’s FedNow RTP network is scheduled to launch in 2023.
The key insight we needed was to understand that we’re dealing with a high latency system. If we flailed around and initiated too many actions in a desperate attempt to fix things, we could have made things worse. For example, if we allowed CC-bank to mail us a refund check, and then $$-bank refuses payment while the check is in the mail, we’d suddenly have a negative balance at CC-bank, and would need to correct the situation there again.
The problem with this whole situation is that the end users of these complex systems, us and the various bank representatives, need to understand these details in order to come to the right conclusions to fix the problem without bad side effects. The bank representatives should have been trained about this, especially at the higher levels, but end users like us have no real business knowing this stuff. The only reason I know enough to realize flailing was a bad idea was because I spend too much time on the internet and happened to read about old fashioned paper check clearing one day.
Without that knowledge, a customer would have reasonably demanded that the money be returned to the proper accounts ASAP and taken some of the actions suggested by the lower tier reps that don’t understand what happened. Those actions would have seemingly fixed the problem in the short term, but would then cause even more imbalances and headaches. It’s like trying to steer a car that’s already skidding on ice by frantically jerking the wheel — it doesn’t help you regain control.
A lesson in working with unfamiliar systems
We deal with lots of crazy systems as data scientists, and very often we’re given very little guidance on the operating properties of the system. If we’re lucky we can talk to engineers that can explain the system to us. If we’re not lucky, we need to figure out how this alien system functions like it’s an archaeology puzzle, or do a bit of black box reverse engineering.
Hopefully, this example today illustrates that even when we go into a system blind, we need to do our best to figure out the system before we start making lots of measurements and drawing conclusions about it.
Very often, you can get a good sense of how things work based on observing the system behavior and the data trail that is left behind by a system. For example, you can notice that even if you initiate a payment on one bank, it doesn’t instantly appear in another bank’s records. There’s a lag that can take up to a couple of days, but sometimes they’re faster. Dates for debits and credits don’t quiiiite align perfectly but they’re close. Various money transfers for some reason take “business days” instead of normal human time. There are lots of little signs of what’s going on under the surface, but unless you engage with them and think it through they’re all very easy to overlook.
And yet, each of those little signals does hint at what is going on under the hood. And while you can’t figure out all the intricate details of such a complex system from these surface observations, you can at least understand enough to know some of the operating properties. This is usually enough information for you to create a hypothesis about the data generation process that you can test by looking at further traces in the data.
As for how does one learn to do this sort of data archaeology…. I’m not sure yet and will have to think about it.
About this newsletter
I’m Randy Au, currently a Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. The Counting Stuff newsletter is a weekly data/tech blog about the less-than-sexy aspects about data science, UX research and tech. With occasional excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise noted.
Supporting this newsletter:
This newsletter is free, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Tweet me - Comments and questions are always welcome, they often inspire new posts
Glad your banking nightmare is over. Tell us more about the speaker rig!