RT Robert Truesdale

Real AI Workflows That Work in IT Operations

Most "AI for IT" content is either breathless hype or so vague it couldn't help you debug a stuck cron job. I've been running AI workflows in production since 2026, and I'm here to tell you what actually moves the needle—and what just burns budget and attention.

This isn't about replacing your job. It's about offloading the repetitive noise that eats your day. Here's what works, what breaks, and where to actually spend your time.

What AI Actually Handles Well in IT Ops

The honest answer: pattern matching at scale. AI excels at processing large volumes of text, classifying it, and extracting structure. It is not a replacement for a sysadmin who understands a system. It cannot troubleshoot a flaky SAN connection or know that the "storage" team broke something again.

What it does handle well:

  • Parsing and summarizing log volumes that would take hours to read
  • Generating first-draft documentation from configs and runbooks
  • Classifying and routing alerts to reduce noise
  • Turning rough notes into usable procedures

The key word is first draft. AI produces output that needs a human who knows the system. If you expect it to do the job end-to-end without oversight, you're setting yourself up for a bad time.

Log Analysis Without the Eye Bleed

Here's a workflow I run weekly: take 72 hours of app logs from our staging environment, feed them through a lightweight Claude instance, and get a summary of what's actually breaking versus what's noise.

Before: I'd open Splunk or the ELK stack, paste around, try to remember the right query, and waste 20 minutes just finding the signal.

After: A script pulls the last 72 hours, sends it to the AI with a prompt that says "list the top 5 error patterns, group by frequency, ignore timeouts under 5 seconds, and flag anything that appeared in the last 24 hours but not before."

What I get back is a paragraph and a bullet list. I can scan it in 30 seconds and decide whether to dig deeper.

The tradeoff: The prompt matters. A vague prompt gives you vague results. I spent three iterations tuning mine to ignore the expected timeouts we know about and focus on new patterns. You will too. Budget an hour to tune this, not 10 minutes.

What breaks: When logs contain sensitive data—and they always do—you need to scrub PII and credentials before sending them anywhere. I use a simple sed script to hash IPs and strip obvious tokens before the logs go anywhere near an API. This takes 15 minutes to set up and saves you a compliance headache.

First-Draft Documentation from Configs

We have a mess of Ansible roles, Terraform modules, and random shell scripts that evolved over seven years. The documentation is either missing, wrong, or both.

I started feeding the actual config files into AI with a simple prompt: "Given this Terraform module, write a usage guide explaining what each variable does, what dependencies exist, and any known gotchas." The output isn't perfect. But it's 80% of the way to a usable doc.

This works because the AI can read code and extract meaning. It doesn't understand our business logic, but it can parse the variable names, the comments, and the structure to produce something a human can edit.

The catch: It makes things up. Not constantly, but enough that you cannot skip the review step. I've seen it invent module outputs that don't exist. Always verify against the actual code.

Where this saves time: Onboarding new team members. Instead of explaining the provisioning process for the third time, I point them at the AI-generated draft and say "this is 90% accurate, let me know what's wrong." Much faster than starting from zero.

Alert Routing That Isn't a Firehose

We get hundreds of alerts a day. Most of them are noise—transient failures, retry cycles, things that resolved themselves. Before AI, we had static rules: if severity is X and match pattern Y, page the on-call.

The problem: static rules don't learn. A new error pattern slips through until someone manually adds a rule.

I set up a workflow where incoming alerts get classified by AI before routing. The prompt says: "Classify this alert as critical/warning/info. If critical, explain why in one sentence. If it's a known transient pattern, say 'ignore—expected behavior.'" Then it routes accordingly.

Results: Critical alert volume dropped about 40%. Not because the alerts went away, but because the AI correctly identified the ones that could wait until morning versus the ones that needed a 2 AM call.

Failure mode: It misses things. A new failure mode that doesn't match anything in its training will get misclassified. You need a feedback loop: when the AI gets it wrong, you correct it, and that correction improves future classifications. Plan for this. You're building a learning system, not installing a magic box.

The Maintenance Reality Nobody Talks About

Here's what nobody tells you about AI workflows: they need maintenance.

Prompts drift. The language model updates and changes behavior. Your systems change and the old classification categories don't apply anymore. You will spend time, maybe an hour a week, keeping these workflows functional.

This is normal. It's the same maintenance burden as any automation. But the AI hype makes people think they can set it and forget it. They can't.

Also: cost. API calls add up. A workflow that processes 10,000 logs a week might cost $50/month now. That's reasonable. But if you scale up to real-time processing across all systems, you're looking at real money. Track your usage. Set budgets. Don't let a runaway script burn through your budget over a weekend.

What I Would Do First

If you're new to AI in IT ops, start small. Pick one repetitive task you do manually and try to automate it.

Good candidates:

  • Weekly log review summaries
  • Generating config docs from code
  • Classifying a specific type of alert
  • Parsing vendor emails to extract relevant info

Bad candidates (don't start here):

  • Anything requiring real-time decision-making without oversight
  • Replacing your monitoring stack
  • "Just ask AI to manage my infrastructure"

Start with the log summary workflow I described. It's low-risk, gives you quick wins, and teaches you how to write prompts that actually work. Once you've got one workflow running reliably, expand from there.

The goal isn't to automate everything. It's to automate the stuff that wastes your time so you can focus on the work that actually needs a human who understands the system. That's what AI does well in IT ops—not magic, just grunt work at scale.