Back to blog
AI & AUTOMATION 6 min read Mar 12, 2026

MCP Agents in Cloud Operations: How We Cut L1 Incidents by 73%

We connected Claude via MCP to our infrastructure stack. Here's what happened when AI agents started diagnosing OpenStack issues autonomously.

Ana Đorđević

AI Engineering Lead

For the first 18 months of running PLATFORMA in production, every infrastructure alert followed the same path: PagerDuty fires, engineer wakes up, SSH into the node, check logs, find the problem, fix it, write a postmortem. Average resolution time: 47 minutes. At 3 AM, closer to 90.

We kept asking ourselves: 80% of these incidents follow the same diagnostic pattern. The engineer checks the same five things in the same order. Why can't a machine do this?

Enter MCP

Model Context Protocol gave us the missing piece. We'd experimented with LLM-based diagnostics before, but the problem was always context. A model that can't read your actual logs, query your actual metrics, or inspect your actual OpenStack state is just guessing. MCP changes that — it gives the model structured, authenticated access to real infrastructure data.

Our MCP server exposes three tool categories: OpenStack operations (list instances, check hypervisor health, read Nova logs), monitoring queries (Prometheus range queries, alert status), and platform state (tenant info, provisioning queue, recent orders).

The Diagnostic Loop

When an alert fires, the agent receives the alert payload and starts investigating. A typical flow: CPU alert on compute-node-07 → query Prometheus for the last 30 minutes of CPU data → list all instances on that hypervisor → identify the instance consuming 94% of host CPU → check if it's a legitimate workload or a runaway process → cross-reference with the tenant's order to verify resource limits → take action or escalate.

MCP AGENT PIPELINE
Alert
Diagnose
Fix
73%L1 auto-resolved
3.2mAvg resolution
12Issues prevented

The key insight is that the agent doesn't need to be creative. It needs to be thorough and fast. It checks every possibility in parallel, something a human can't do at 3 AM. And it never forgets to check the obvious things — DNS, disk space, certificate expiry — that tired engineers skip.

Results After 90 Days

73% of L1 incidents are now resolved by the agent without human intervention. Average resolution time dropped from 47 minutes to 3.2 minutes. The agent handles the boring, repetitive diagnostics. Engineers handle the interesting, complex problems that actually require creativity.

We also discovered something unexpected: the agent catches issues before they become incidents. By running periodic health checks rather than waiting for alerts, it identified 12 potential problems in the first month that would have become customer-facing issues.

What We Learned

Don't try to make the agent handle everything. Our first iteration tried to be too clever — it would attempt complex remediation actions and sometimes make things worse. The current version has strict guardrails: it can diagnose anything, but it can only take safe remediation actions (restart a service, clear a queue, scale a resource). Anything destructive gets escalated to a human with a full diagnostic report.

MCP is the right abstraction layer. The model doesn't need raw SSH access. It needs structured tools that return clean data. The MCP server is our security boundary — it controls exactly what the agent can see and do.