Error Triage & Root Cause Analysis with AI Agents

In the middle of my day, between two meetings, an error alert fired. I stopped what I was doing and opened it.

The session had completed successfully. The alert was simply a warning that the video had been cutting in and out.

Okaya is a conversational operational readiness system designed to help individuals and organizations manage stress in high-impact careers. A session is a guided, private conversation using voice and video, so even non-fatal errors are worth paying attention to.

By the time I looked more closely, the system already had an explanation. An agent had pulled logs from frontend and backend systems, checked the database outcome, correlated events, and identified the likely root cause:

Opera temporarily disables the camera when a user switches tabs.

That was new to me. I didn’t realize Opera behaved differently than Safari, Edge, or Chrome. It’s hard to know everything, and browser-specific behavior like this is often poorly documented. Google doesn’t always help.

In the past, an alert like this might have taken half a day to fully understand. I would have been pulled out of my workflow, moving between logs, events, and database records, trying to piece together what happened.

Why did I create this agent?

I’ve performed this same process hundreds of times across the companies I’ve worked for. It’s a huge time sink to pull engineers away from their work just to investigate alerts—especially when the data is scattered across systems.

So I built an agent that runs automatically when a session completes. It gathers relevant context, summarizes what happened, and explains why in plain language.

One important note: AI was very helpful here, but it didn’t work on its own. A human still had to design the system with observability in mind and decide how the triage automation should work.

Originally published on LinkedIn — view the original post for comments and reactions.

← All posts