Why Automation Breaks (And the Boring Checks That Save You)

Automation is great until it isn't. That's the part nobody tells you when they're selling you on the promise of pipelines that run themselves. I've been running IT operations for two decades, and I've seen more automation fail at 2am than I care to count. Not because the tools are bad—most of them are fine. But because automation lives in a world where things change, and nobody wrote a script to handle the thing that just changed.

Here's what I've learned: the boring checks are what save you. Not the clever automation. Not the AI wrapper. The boring checks.

What Actually Breaks Automation

Let me tell you about the Jenkins job that ran every night for three years without issue. It processed billing files, pushed data to the database, generated reports. Beautiful. Then one Tuesday the API endpoint it called changed its response format. No warning. No changelog posted. The job "succeeded" because the exit code was 0—it just wrote garbage to the database for six hours before someone noticed.

That's the thing about automation: it does exactly what you told it to do, even when that's the wrong thing.

Common failure modes I've seen:

Credentials expire and nobody notices until auth fails at 3am
API rate limits get hit and the script hangs silently
Disk fills up and the automation just stops logging anything useful
Dependencies update and break backward compatibility
Network blips cause partial runs that look successful

The automation doesn't tell you it failed. It tells you it ran. That's not the same thing.

The Monitoring Trap

Everyone says "you need monitoring." They're not wrong, but they're not right either. The problem is most monitoring is built for the happy path. It alerts you when something is down. It doesn't alert you when something is quietly wrong.

I learned this the hard way with a content pipeline I built for a site that needed daily updates. The automation pulled data from three different sources, merged it, formatted it, and pushed it live. For months it worked perfectly. Then one source started returning null values for about 30% of records. The pipeline didn't crash—it just propagated nulls to the live site. I got a call from the content team asking why the site looked "broken."

The monitoring was watching for failures. It wasn't watching for wrongness.

Now I check for three things that monitoring tools rarely catch: data shape (is the output what I expect?), data volume (did I get roughly the same records as yesterday?), and data quality (are there obvious anomalies like nulls in critical fields?).

The Maintenance Tax You Didn't Budget For

Here's something that never appears in the automation sales pitch: everything breaks eventually, and everything needs tending.

I ran an automated backup system for years. Solid. Reliable. Then one day the retention policy I'd set up in 2026 hit its limit and started failing silently because the backup drive ran out of space. The system kept trying to write backups, failing, and moving on. We only discovered it when someone actually needed a restore and found nothing from the past three months.

The fix was simple—add a check for disk space before each backup run. But nobody wrote that check because the original developer assumed there'd always be space. Assumptions are the enemy.

Every piece of automation has a maintenance tax. You pay it in time, debugging, and occasional panic. Budget for it.

The API Problem Nobody Talks About

If your automation talks to an API, you have a dependency problem. APIs change. They deprecate endpoints. They add rate limits. They require new authentication methods. And they almost never tell you in a way that breaks your automation cleanly.

I had a script that pulled inventory data from a vendor API. Worked for a year. Then the vendor added OAuth2 requirements and sent an email about it—buried in a newsletter, not a direct notification. The script stopped working and nobody noticed for two weeks because it was a "low priority" process.

Now I build in three things for any API-dependent automation:

Health checks that verify auth is still valid before each run
Version pings that alert if the API version changes
Fallback behavior that stops the run gracefully instead of limping along with stale data

Yes, this adds code. Yes, it takes longer. But it's faster than explaining to management why we published stale inventory for a week.

The Logging Lesson That Cost Me a Weekend

I once spent an entire weekend debugging a failed automation that ran on a schedule I couldn't access. The script was supposed to trigger at 6pm, process some data, and exit. It was exiting. But it wasn't processing anything.

The logs said "completed successfully." That was useless.

The problem was I was logging the start and the end, but nothing in between. I had no idea where it was failing or why. Turns out it was hitting a validation check I didn't know existed and silently skipping all the work.

Now I log aggressively. I log entering each major section. I log the count of records before and after processing. I log anything that could possibly matter, because when something breaks at 2am, the only thing you have is those logs.

The rule is simple: if you can't tell what happened from the logs, you don't have logs—you have a false sense of security.

What I Would Do First

If you're looking at your automation setup right now and wondering what to check, here's where I'd start:

Verify your automation actually ran — Check the timestamps. Check the exit codes. Don't assume because nothing exploded that everything worked.

Check your credentials — Expiration dates, permissions, access tokens. All of it. Do this before anything else because this is the most common silent failure.

Look at your output — Not just "did it complete" but "is the output correct?" Run a spot check. Compare today's output to yesterday's. Look for volume drops, null values, or format changes.

Check disk and API limits — Disk space, API rate limits, memory usage. The boring stuff. This is what kills automation at the worst time.

Read your logs — Actually read them. Not for errors, but for gaps. Missing entries. Silent skips. Anything that looks off.

Test your recovery — Can you actually restore from your backups? Can you manually run the process if automation fails? If you don't know, you don't have a backup plan—you have a hope.

These checks take maybe 15 minutes a week. That's the trade-off that's worth making: fifteen minutes of boring verification versus hours of fixing something that broke silently.

Automation is useful. It's worth building. But it's not set-and-forget. Nothing is. The boring checks aren't sexy, but they're what keep you from explaining to your boss why the system that was "running itself" just served garbage to customers for six hours.

That's the real automation lesson. The script does what you tell it. You better be telling it the right things.