Good runbooks are a MUST – unless you want to risk a heart attack


It’s a dull, rainy Friday. You finish coffee with your bestie and head home with your work laptop. On-call isn’t great, but at least it pays extra.
As soon as you get home, the company phone starts ringing. You find it and feel a mild panic – it’s OPsgenie. You know something is wrong, and time is critical. You grab your laptop, turn it on, find the alert it triggered for, and start troubleshooting.
You get an alert for increased errors on the spam service, follow the runbook, and realize you can’t fix it, so you call the team. They handle it, and you still have the rest of the night free.
Has this ever happened to you? Well, it can if you have good runbooks!
We made our first runbooks, and oh, they were great
When reacting to alerts, stress is inevitable, causing you to miss clues to the root cause. More often than not, you end up going through the documentation of how something works, instead of looking for what broke.
So, let me take you back to the time when we created our first runbook.
When I first started as a Reliability Operations Engineer, this was a newly formed team, so we didn’t have any alerts or runbooks. But, we knew we were blind on many fronts and needed to catch platform issues somehow.
This led us to inherit some alerts from teams like SRE and start creating our own. We were proud of our alerts, cheering when they caught incidents. Looking back, they were mediocre, but at the time, we didn’t know better.
It was smooth when alerts came during our morning shift, as every dev team was online, and we could ping them for clarification on issues.
This is a very fruitful process we do almost daily – we receive an alert, notice it’s for a specific service, ping dev teams for information and clarification, and then decide how that information affects what we want the alert to catch.
Was the alert useful? Did it notify us of an actual issue, or did it alert us to normal behavior for that service/location? We can then exclude errors from specific locations where they’re expected or fine-tune the thresholds for the alert condition.
Responding to our first on-call alerts nearly gave me a heart attack
I still remember it clearly – we had just started on-call rotations, and it was my turn. The previous week was calm, so I expected the same. After work, I did my groceries and chores, keeping my laptop and company phone nearby. Everything was quiet, so I showered, got ready for bed, and slept. Then, at 1:30 AM, my company phone rang.
Waking up disoriented, I stumble in the dark, avoiding lights to not wake my girlfriend – she really values her beauty sleep. I grab the phone, see an OPsgenie alert, and mild panic sets in. I head to the living room, grab my laptop, and start troubleshooting.
The alert was named ‘Whitelabel CUP DOWN,’ and I had never seen or heard of it. Normally, I’d start troubleshooting and look for what went wrong, but here, I had no idea where or what to search for. The pressure was mounting, as reacting to this alert was my responsibility, and I didn’t know what to do.
For the next 45 minutes, I stumbled through outdated Confluence pages and old Slack threads, trying to find anything about Whitelabel CUP. I was lost, unsure which logs to check or if it was impacting clients.
At that point, I decided I couldn’t just keep searching the documentation; I needed to act. But it was past 2 AM – what if it wasn’t an issue, and I woke someone up for no reason? Who should I even call?
To my relief, the alert resolved itself. I was ecstatic it was over, but what if it happens again? Where are the logs? What if someone’s already looking into it? What if clients have complained? I had so many questions, so I wrote them down, closed my laptop, and went back to sleep, hoping it wouldn’t trigger again. And it didn’t.
How can we improve this?
The next morning, I admitted my struggles in the daily, feeling hopeless because I didn’t know how to verify if it was an issue or a false positive, where to find logs, or who to call. As a team, we decided to dedicate time to figure out how to handle this better in the future.
The root cause was my lack of knowledge about Whitelabel CUP and how issues with it manifest.
This can be addressed in two ways: as a team, we could all invest time to learn about it, but that would require significant man-hours, which isn’t the most efficient approach, especially when a new person joins. Alternatively, we could assign one person to learn the details, create a runbook, and then share it with the team.
This is how to create a runbook
The task I was working on started as documenting Whitelabel CUP but evolved into creating a runbook with clear handling steps. Where to check for logs and, most importantly, who we should call if we observe problems, so that once we receive the alert, we have a flow to help us troubleshoot it successfully and in a timely manner.
A prerequisite for good runbooks is having good alerts. Because if your alert is bad, the runbook you create for it will, at best, be just as bad. So, for the sake of this article, I’ll skip this part – I’m assuming you already have good alerts, right? RIGHT?
I’ll share some generic tips that worked well for us, though your needs may vary since every alert is different.
Runbooks should be written as clearly and unambiguously as possible – so even the newest team member can successfully troubleshoot the alert.
The first step in a runbook is to quickly assess whether it’s a false positive or a real issue. This might involve checking synthetic monitoring, seeing if the same issues appear there, looking at a different data source, or simply checking how production traffic is behaving.
After the ‘false positive check’ comes the juicy part of the runbook. You can direct people to check logs or Grafana panels to confirm whether a hypothesis is true or false. The best approach is to ask a question and provide the answer right in the runbook.
For example, the most common one: ‘Is there packet loss towards this location? Check here: grafana.com/packetloss.’ This way, you’re asking the question and providing an easy way to check, all in the runbook.
Next, ask, ‘Is only service X affected? Check here: all-service-logs.com.’ This helps determine if it’s just one service or multiple. For service-specific alerts, include mitigating actions like, ‘Find the affected instance here: list-of-instances.com. Redeploy the instance with the highest error rate here: instance-manager.com.’
The goal is to have the ideal troubleshooting process, or as close as possible, while making it quicker. By providing exact links, you save time on typing queries and searching for the right Grafana panel.
Good runbooks are an ongoing, iterative effort
Effort is appreciated in all things, and runbooks are no exception. Creating a good runbook seems simple at first, but once it’s used, you’ll realize there are gaps, confusing parts, or areas for improvement.
With experience handling that alert, you’ll be able to streamline things, adding steps or excluding ones you initially thought were necessary. Keep in mind, you’re not writing documentation, but a quick guide to help you get to the root cause and take action -whether that means mitigating the issue yourself or engaging the responsible team.
Good runbooks are an ongoing, iterative effort. We update them when something’s missing or outdated. Done right, they speed up issue detection and mitigation, and help onboard new team members. Just give them a runbook and let them ask questions – it’s a great way to gather fresh insights.
If you have alerts, try them out – they’ve made our lives easier, and they might do the same for you.