Over the past couple of days, NYC has had a sudden cold weather snap that brought us to about freezing temperatures. It’s somewhat colder than the typical range for mid November, but this sort of weather would be an everyday occurrence by around January. The reason I bring this up is because this year marked the first year out of a very many years, where I not fighting and fiddling with the heating system at home because everything is actually working as intended. It was a multi-year journey of trying to understand this ancient, bizarre, critical system that lurked within my walls.
Most specifically, the system had various issues with it over the years which led to certain rooms in the home being cold. It took years of experimenting, making repairs, and painful guesswork before I managed to finally learn enough to feel confident in running the danged system.
What strikes me about the whole process was how utterly painful it was to poke and prod at essentially a black-box system and reverse engineer what was going on. I would have expected that having years of examining black box data systems that I’d be good at this other thing too, but nope!
Some context
The housing stock in NYC is generally very old — that means the heating systems built into them are reflections of their times. The home my parents raised me in was built in the 1920s and uses steam and huge cast iron radiators as the source of heat in the house. My home is slightly more modern, dating from the postwar period, and uses what’s called a “hydronic” system, which means that there’s a furnace that heats water, which circulates in a closed loop to smaller baseboard radiators.
When such a system is brand new and everything is in working order, these systems are pretty nice. They’re not as noisy as ancient steam systems, which have frequent bizarre banging/knocking/hissing sounds due to water hammer and steam venting. Water can store a lot of heat energy so the system tends to heat and cool in smooth curves.
But the one problem with these hydronic systems is that they can frustratingly break down if a bubble of air develops in the pipes and effectively cuts off the circulation of hot water. That can leave one or more radiators dead cold. Fixing the issue requires getting rid of the air in the system. There’s a lot of running around operating tiny rusty valves to bleed air out of high spots near radiators, messing with water pressure in an effort to flush the air to a vent, and even completely draining and refilling the system with water in hopes the filling process will remove the bubbles. But since water straight from the tap contain plenty of dissolved air that eventually separates out, so while a refill will fix the problem in the short term, you’re always hoping that the system’s venting equipment can bleed out any air that comes out before the problem reappears.
Trying to figure out the system blind
Even in the first winter that we spent in this place, we realized that there was one room that was consistently colder than all the rest. There was also a small room next to it that was only marginally better. And thus began the debug process.
A quick googling around brought to light that there was likely an air bubble that locked off that loop of heating so I ran around venting air out of all the radiators. Air did come out, but it didn’t really bring the heat we wanted. Instead it just sputtered some rusty heater water all over the place.
Next step was to drain and fill the whole system. That involved hoses and dumping rusty hot water out in a fairly intricate dance of turning valves off or on at various points to control how pressurized water flows through the system without risking cracking a hot furnace by dumping cold water on it suddenly. Refilling took an oddly long time, but that gave me time to run up and down the house undoing air release valves to try to get things running. That… to my great frustration… SOMETIMES worked.
I got so tired of running around venting air that I replaced a bunch of the manually operated air vents with automatic float vents. Those are more convenient but will eventually fail after a bunch of years so I’ve just created a pile of tech debt for myself.
It was around here that I started formulating all sorts of theories about what was going on. With all the fiddling around, depending on what seemed to be dumb luck, sometimes the whole front half of the house would be dead cold, and I’d have to mess with the system again. There must be forks in the piping that brought heat to the front versus back of the house. You could tell this because of what things got hot first and how radiators closer to the furnace would be hotter than ones further away.
Since sometimes the cold radiator didn’t seem to have any water inside of it, I’d started thinking that maybe there the system wasn’t set at the correct pressure level to make sure that water got to where it needed. I tried tweaking the pressure higher or lower, using formulas found online to figure out how much pressure I needed to lift hot water high enough. The attempts to fix again worked inconsistently.
Through countless experiments and observing how heat would flow from one pipe to another (learned through a lot of running between rooms chasing water flow like a maniac), I’d built up a rough guess at how the pipes would go from room to room, floor to floor.
Things got so messed up that eventually I opted to install a valve in the cold room so that I could drain the system to a bucket in that room once I over-pressurized everything. More often than not, the process would dump cold water out for an eternity while every other room in the house was warm. The valve would finally spew warm/hot water out as a sign that I’d drained all the dead, unmoving cold water out of that part of the loop and finally heat would come to that radiator… Except it didn’t. Water would just sit dead in that section, though if I somehow managed to get the pressure levels juuuuuuuust right, I’d be blessed with heat for a couple of days before the system would go out of whack again and I’d have to repeat the process. It was maddening. We gave up using that room in during winter months.
So, in my head, I had this loop of heating pipe that was constantly not getting enough water flow unless conditions were ideal. It was clear that this only happened on the furthest section from the main heating system. I had no idea how to fix it.
Finally, a catastrophic failure explains it all
Since the cold rooms weren’t being used very much, we actually limped along in this weird state for a bunch of years. But finally, one day, the heating system finally had a critical failure — the circulation pump that was responsible for moving water through the pipes stopped making any noise and the whole house was slowly freezing. Since I am most definitely not capable of replacing such a critical component, the pros get called in.
They come and note that the pump is really dead, which was expected. But they also notice that the water regulator at that lets water into the heating system up to a specific pressure was also weird. Finally, when trying to close off water so that they can replace the components, they find out that a gate valve upstream of the heating system had corroded/worn itself to shreds.
In the end, it seemed that the broken valve had been flushing chunks of corroded brass down the line, which badly gummed up the intake regulator and caused the really slow filling times I had experienced. Then, the bits probably circulated in the heating system and slowly broke the circulation pump over the course of years. Since the circulator had been very slowly getting ground up by chunks of metal bouncing around in the mechanism, it was had been too weak to move water through the whole house — the cold room, being on the most distant branch of piping, just didn’t see enough circulation force and stayed cold despite my constant efforts.
Now, with the pump, regulator, and ruined valve replaced, the heating system has largely stopped giving me trouble. The cold room isn’t permanently cold any more.
With perfect hindsight, recognizing that the intake was really slow might have given a hint that the system was slowly dying from that end of the system. But my patchwork understanding of how these system was put together wouldn’t have known what to do with that bit of information. Without the large catastrophic system failure to highlight the issue, I would have been very unlikely to have opted for the expense of replacing some major components.
Ok ok, let’s bring this back — inferring system behavior is a VERY hard craft!
I didn’t just go on a giant story about plumbing purely for the fun of it (fun was only part of the reason). The process has an interesting parallel to what data folk have to sometimes do with black box systems that generate data in mysterious ways. That weird CSV file that comes from the warehouse? The data dump from some newly acquired subsidiary that you now have to report against? Sometimes that kind of work is very much like debugging a broken furnace.
The key point here is how I was faced with this unfathomable relic from the 1950s that I could only interact with from a limited set of access points. I had to build a mental map of the system by literally running around using my hands to feel for heat in radiators and walls. Everything was guesswork and conjecture.
While debugging the thing, all the guides and knowledge about these systems available online resembled the disjointed chunks of knowledge found on Stack Overflow for coding problems. There’s knowledge there, but you have to read, digest, figure out what bits appear relevant to your own situation, and then experiment.
So the strategy for progressing was to experiment and take notes in my head. I would try one suggested remedy, see how everything reacted, then repeat over and over as my repair attempts failed. Sometimes it would work for a short period, sometimes it wouldn’t even work the first time. Eventually, each observation would provide a set of constraints that narrowed down how the system must be set up in order to explain the behavior.
Unlike most of the software systems that we work with at our jobs, I had no one I could ask questions about this system. The previous owner and the builders are long gone and obviously left no blueprints. I don’t even know if it had been competently installed or just slapped in as a budget afterthought. The system itself is old and rusty, so it’s definitely not running up to original spec and you just gotta account for a certain amount of weirdness. It’s probably the most frustrating system I’ve had the displeasure of working with simply due to its sheer inscrutability.
Each system is like a little science project in and of itself. Every interaction gives you a set of hints and constraints that help narrow down what the system is actually doing. But the actual explanation is a mere figment of imagination and conjecture in your head. I THOUGHT it was reasonable that the builder separated the front and rear halves of the house in a fork early on, and the system appears to operate that way for the most part, but there’s some instances that seemed to indicate there’s more weird little loops and branches in the walls than I understand, All I have is this “theory of how heat works in my house” that seems to fit the limited data I have, and hasn’t been refuted yet. Until the next thing breaks in an unexpected way.
This is the EXACT same process I employ when I get thrown at a weird data table/export that I have to wrangle and don’t have anyone to ask about. Every row is an observation that is a reflection of the normal operation of some system built by a hopefully competent engineering team. They probably used reasonably standard design patterns and collect/generate data along certain patterns. Closely reading the data and how it evolves should provide constraints about how things worked — at least enough to let me guess how the system works and make a usable analysis.
As an example, imagine you’re staring at the database of an e-commerce site. You can see orders being made and put into the orders table. There’s a status field, as well as a reference to a shipping table. By examining the timestamps of all these items, you might be able to piece together how the order table only updates the status field when all the shipments of an order (which could cover multiple packages at once) have been marked as shipped and not a moment before. It seems like a workable theory and seems to fit all the example data you have. So you build out dashboards in accordance to those assumptions, and everyone’s happy.
Until one day, when you finally manage to get in contact with an engineer of the software (let’s assume it’s some 3rd party software that your store bought). You ask them how the order/shipments table works and they blink in surprise. Why, the shipment and order statuses are actually handled by an order cleanup job that triggers whenever certain conditions are met — all packages shipping is one of the conditions, but payment issue resolutions, order returns and cancellations, and even just a 7 day timeout counter. Your theory only really covered the first, most common case, and missed everything else. So all your reports are… mostly correct but slightly broken in some places. Now you have to reevaluate everything.
This is the risk we take when we have to work with black box systems. It’s the same epistemic problem that faces any scientific endeavor. In artificial contexts like data collection systems, we can often work around it by questioning the people who built the systems, but in the cases where it’s simply not an option we have to fall back on this rather error-prone way. We’re always one contradicting data point from collapse.
Oh, and if that doesn’t worry you enough, consider large distributed systems that are complex enough to exhibit some emergent behavior. Now, there’s no one you can ask who understands the system well enough to describe what everything does.
If you’re looking to (re)connect with Data Twitter
Please reference these crowdsourced spreadsheets and feel free to contribute to them.
A list of data hangouts - Mostly Slack and Discord servers where data folk hang out
A crowdsourced list of Mastodon accounts of Data Twitter folk - it’s a big list of accounts that people have contributed to of data folk who are now on Mastodon that you can import and auto-follow to reboot your timeline
Standing offer: If you created something and would like me to review or share it w/ the data community — my mailbox and Twitter DMs are open.
New thing: I’m also considering occasionally hosting guests posts written by other people. If you’re interested in writing something a data-related post to either show off work, share an experience, or need help coming up with a topic, please contact me.
About this newsletter
I’m Randy Au, Quantitative UX researcher, former data analyst, and general-purpose data and tech nerd. Counting Stuff is a weekly newsletter about the less-than-sexy aspects of data science, UX research and tech. With some excursions into other fun topics.
All photos/drawings used are taken/created by Randy unless otherwise credited.
randyau.com — Curated archive of evergreen posts.
Approaching Significance Discord —where data folk hang out and can talk a bit about data, and a bit about everything else. Randy moderates the discord.
Support the newsletter:
This newsletter is free and will continue to stay that way every Tuesday, share it with your friends without guilt! But if you like the content and want to send some love, here’s some options:
Share posts with other people
Consider a paid Substack subscription or a small one-time Ko-fi donation
Tweet me with comments and questions