Postmortem: The Boiler Incident

The following is an attempt to clarify the circumstances of a production incident that happened more than 10 years ago; it is also a lesson to all of us to decouple our production servers from boilers in distant Balkan countries.

A couple of weeks ago, while we were having a casual conversation over a cup of coffee, one of our veteran engineers said something that piqued my interest.

“This reminds me of the time when boiler broke down in the Zagreb office, and production went down.” – he made a casual remark.

“I beg your pardon?” – I expressed my interest, and my left eyebrow rose slightly to signal disbelief.

“We experienced major degradation of service across all products served over data center in Frankfurt.” – he continued in a neutral tone of voice, as if he was reading the airport announcement for an upcoming flight.

“You are not being serious!” – I exclaimed in a shocked tone of voice, as if I realized I was late for a flight.

“Of course I am serious!” – he said.

Here’s the full story.

Boiler in Zagreb crashed the servers in Frankfurt

“You remember the old office building in Zagreb, the skyscraper with the giant advertisement for toilet paper?” – he asked me.

“Of course I do, I worked there for a couple of years. Very appropriate ad for the state the building was in.” I answered

“Well, there was an old rusty water heater in the basement, and it broke down” – he said.

The next thing we know is that we are getting a ton of angry calls from our customers. I don’t remember the details, but it was connected somehow, a boiler in Zagreb crashed the servers in Frankfurt.

“Sounds plausible.” – I stated calmly, convinced he was pulling my leg.

“Who can tell me more about it?” – and provoked him to continue, which he did without hesitation.

The authorization server went down first

“I think the authorization server was the first to crash; it was handled by One API team. The team was disbanded years ago, but most of the people who worked in the team still work here, so you can try talking to them.” – he gave me names, and I jotted them down in my notebook.

It made sense that an issue with the authorization server would affect all the products. It is an obstacle all aspiring requests must overcome, whatever their place of origin may be, if they wish to mature and become successful responses one day.

But how does the boiler fit in the picture? Did we start as a water heater provider, and then pivot to communication platform space?

The code of the authorization server I checked out provided no clues, so I sent a Slack message to the former One API team leader.

“I remember nothing!” she replied and moved on with her duties, ignoring further prompts, so I decided to continue my investigation elsewhere.

Boredom led me into the dark archives of Jira

As coding agents and other AI tools are doing my work now and I have nothing better to do, I started digging through git history and Jira archives.

After a couple of hours I found a clue, there was a ticket with description Reimplement the logging and email adapter, so that we avoid the issues such as the one we had when there was no electricity in Zagreb office.

There was a link to a pull request attached to the ticket; the username of the author belonged to a prominent figure, so I sent a Slack message to our Engineering Director: “We need to talk about a pull request you made to the auth server ten years ago.” Shortly after, he invited me to his office.

“It was a warm Spring day, the Sun was shining, the birds were chirping, the children were playing and production was down. We were drinking in the pub.” – he started reminiscing, “The production had a habit of crashing each time we went for a beer.” He confirmed that there was a serious outage and that it was connected to an email server; he couldn’t quite remember how the boiler fit into the story.

“Why was authorization server sending emails?” – I asked.

At the time we implemented a brute force attack detection mechanism that could ban or lock attackers. Sometimes a legitimate customer would get banned, so they would contact their dedicated account managers to help them. When a user, or a combination of a username and an IP address to be more precise, was blocked the authorization server would automatically send the data that might come in handy to the account manager.

“I see, so you mean something like detailed description of our current algorithm, automation and telemetry that helps resolve the issue but edited out because of security reasons and not breaching NDA via public magazine article?” – I asked only to show off my understanding of our security infrastructure in hopes he would be impressed and commend me to someone who is responsible for raising my salary.

“It was nowhere near that complex, we were not operating on this scale, and such issues would not happen very often, unless someone got too enthusiastic with pen testing” – he responded bluntly and with a look on his face that I interpreted as him having no intention to commend my deep understanding of our security infrastructure to someone responsible for the pay raise.

Then he shared a couple of interesting production outage stories from the early days and gave me advice to talk to our first system administrator in the company, since that gentleman was maintaining the email server at the time.

Let’s pay a visit to our SRE team

As I talked to the system administrator, he shared stories that can put this one to shame: duck taped disks, convincing people to put servers somewhere where rain won’t affect them, crashing the premises of a 3^rd party vendor (euphemism for a guy who has a couple of servers at home and rents them out to small companies) who is putting server offline to make a firmware update during peak traffic time.

He could not remember the boiler issue but recalled there was an unrelated air conditioner incident one summer in the same office building when an air conditioner cooling the centralized control system for other air conditioners overheated because it was installed on the rooftop.

I decided to pay a visit to our SRE department as their lead engineer was a member of the One API team once upon a time.

The SRE department is taking care that our global platform, spanning several continents and dozens of data centers, works reliably. They are working on early detection and quick mitigation of service degradations that can happen as we make thousands of changes to production every day.

Surely, their lead engineer has nothing better to do than to talk to me about a production issue from a decade ago.

There was some electrical maintenance in the office building, and right before it someone reconfigured auth server to send e-mails to a standby instance in Zagreb office, twenty-one thousand four hundred and eighty-seven emails were stuck in the queue. Can’t talk now, sharks are eating network cables again.

…he said, hurrying off with his coffee.

Boiler blew, mail server died, everything froze

I have finally connected all the dots and found someone to confirm the theory.

Our principal engineer – once a junior contributor to the auth server – explained what happened when he returned from parental leave:

The same thread that banned the user also sent the email synchronously via RPC, with no timeout. When the boiler broke, it tripped the circuit breaker, cutting power to the floor. The mail server was down, and so were we – waiting in the pub for the electricity company. When customers complained, we were one thread dump away from realizing all request-handling threads were stuck waiting on the mail server.

“The circuit breaker broke, but there was no circuit breaker around the call to mail server!” – I joked nerdishly.

“Exactly! After the electricity company fixed the issue, in order to increase reliability someone duck taped the circuit breaker. If we had those things back then, could you imagine what would status page and post incident review look like?” – he asked laughingly as we continued to entertain that thought.

Status page – Frankfurt: Service degradation across all products

We are experiencing service degradation across all products due to boiler malfunctions. The plumber is on his way to the office building, and we expect to resolve the problem shortly. We apologize for any inconvenience this may cause.

Postmortem: Boiler malfunction (incident #420)

Date: April 1, 2010

Authors: Alice, Bob

Status: Complete, action items in progress

Summary: The boiler broke down and all products experienced service degradation for 2 hours

Impact: A large number of messages never went out, there was a noticeable financial impact, new valve on the boiler and 1000 dollars for one hour of plumbers’ work

Root causes: Cascading failure due to a water heater malfunction in Zagreb office building, the floor lost electricity, and so the email server went down. At the time the authorization server in Frankfurt was configured to send emails over that email server. The calls to the email server were sent synchronously with no timeout, so all threads were blocked on the call to the email server, and there were no available threads to serve the incoming requests.

Trigger: Malfunction in the boiler, waiting for the analysis from the plumber.

Resolution: Authorization server reconfigured to use the email server instances in Frankfurt data center.

Detection: The customers complained before we detected the issue.

Action items:

Batch emails and send them asynchronously

Introduce a circuit breaker around RPC call

Duct tape the circuit breaker of the water heater

Put the electrician and plumber’s phone number on speed dial

Lessons Learned

Don’t couple the production traffic to the boiler.

Timeline

2010-04-01 (all times UTC)

9:45 – deployment of authorization server in Frankfurt, it starts using mail server in Zagreb due to a configuration mistake

10:32 – the boiler burns out and triggers an issue with electrical supply to the building, the rack with the email server also loses power

10:45 – the office building manager calls the electricity company

10:50 – One API team goes for a beer in the nearby pub

11:05 – a young and enthusiastic application security engineer hears about the new brute force detection mechanism and decides to test it

11:30 – electricians blame the plumbers

11:35 – OUTAGE BEGINS a lot of request handling threads are waiting for response from the email server, authorization server starts throttling incoming requests

11:45 – customers start calling the customer support, the line is busy we are trying to reach the plumber

11:50 INCIDENT BEGINS all of request handling threads are in waiting state, no traffic is processed

12:00 – in the pub One API team jokes about how the last two times they went for a beer production went down

12:05 – plumber answers the telephone and starts negotiating his hourly rate

12:30 – plumber agrees on an hourly rate

12:45 – One Api team notices missed calls from the customer support team

12:55 – One API team rushes back to the office and bumps into team of electricians telling them they can’t go in right now

13:20 – One API team makes a thread dump and figures out the issue

13:35 – the configuration of authorization server is changed

13:45 – OUTAGE MITIGATED the service stars recovering

13:50 – INCIDENT ENDS

15:55 – the plumber enters the building

16:05 – the plumber leaves the building after replacing the valve on the boiler