When systems go down, devs still juggle 10 tabs. PagerDuty says MCP fixes that

Ivan Pelivanovic

Production incidents are a context problem. By the time an engineers understand what's happening, they've already bounced across several different tools - and the incident is still ongoing. PagerDuty thinks MCP is the fix.

Production incidents are a context problem. By the time an engineers understand what’s happening, they’ve already bounced across several different tools – and the incident is still ongoing. PagerDuty thinks MCP is the fix.

When incidents hit production systems, engineers rarely stay inside one tool for long, jumping from logs to dashboards to runbooks, trying to reconstruct what is actually happening.

Talking to other builders, it seemed like almost everybody faces this context-switching problem.

Rocío Bayon (Product Manager) and Sebastian Villanelo (Sr. Forward Deployed Engineer) from PagerDuty think MCP is how you fix it.

PagerDuty built their MCP to cut context switching

Rocío explained that their MCP is solving the issue of context switching:

When an incident hits, the engineer has to go between 5 to 10 different tools to understand what’s happening.

That’s the real problem they’re trying to solve.

PagerDuty’s framing of MCP was interesting: neither Rocío nor Sebastian described MCP as just another integration layer. They framed it as connective tissue that gathers logs, alerts, runbooks, and incident context into a single workflow.

What the MCP does, it brings all that context into one platform where engineers are usually already working.

Most engineering organizations already have enormous amounts of observability data. The real problem is that it is scattered across systems, and engineers end up reconstructing operational context manually during incidents.

Retrieve what you need, nothing more

Sebastian framed the problem as signal retrieval. Rather than feeding the model more information, the goal is pulling the relevant operational state around a specific incident.

If you have the right parameters or the queries and all this stuff, you will retrieve the exact information that you need.

That means narrowing context around the actual incident window. When an incident hits, it retrieves information around that time only, Sebastian explained.

That also changes how they think about efficiency, reducing context switching directly affects operational speed, token usage, and cost.

You will see that information only with one call. And that saves a lot of tokens and time. That’s money and time.

Photo: Lea Lobor

AI helps but engineers still decide

Still, both of them were careful not to frame AI as autonomous incident management.

Rocío repeatedly emphasized that MCP and AI systems are primarily helping with context gathering and operational visibility, while engineers remain responsible for the high-risk decisions:

The AI is helping you, but the engineer is the one who is assessing and making decisions where there’s a high risk.

That human layer is intentional. PagerDuty’s broader vision seems less about replacing on-call engineers and more about reducing the operational overhead surrounding incidents. Their MCP systems help gather information, surface relationships between systems, and accelerate investigation workflows, but humans still decide what actually happens next.

Rocío also mentioned that their SRE agent is designed to support larger incident workflows beyond information retrieval:

It can also help you trigger those incident workflows. So it can help you resolve the incident. And it learns as it goes.

“MCP – the connective tissue between tools”

I asked Rocío and Sebastian, how does MCP fit into the tools they already use without becoming just another silo.

And both of them clearly framed MCP as anti-silo infrastructure since it brings everything to one place. Rocío called MCP “the connective tissue between all these different tools.”

That framing probably captures the broader architectural challenge better than anything else in the interview.

Modern incident response already spans dozens of systems: observability platforms, deployment pipelines, CI/CD tooling, ticketing systems, infrastructure management, and communication layers.

AI systems inherit that fragmentation unless something explicitly connects operational state.

Engineers trust systems that behave predictably

Sebastian mentioned that teams often react very differently to MCP systems. Some embrace them immediately while others remain skeptical, especially around security and predictability. For him, trust improves once systems consistently produce expected outcomes:

When a person or a teammate says “ah, I’m retrieving what I’m expecting to retrieve”, that will help them to trust it.

A lot of AI tooling discussions still focus on model capability, reasoning quality, or benchmark performance. But operational systems are usually adopted much more pragmatically. Engineers trust systems that behave predictably, retrieve the right operational context, and fit into workflows they already rely on.

> subscribe shift-mag --latest

Sarcastic headline, but funny enough for engineers to sign up

Get curated content twice a month

* indicates required

Written by people, not robots - at least not yet. May or may not contain traces of sarcasm, but never spam. We value your privacy and if you subscribe, we will use your e-mail address just to send you our marketing newsletter. Check all the details in ShiftMag’s Privacy Notice