RT Robert Truesdale

Automation That Breaks at 2am: The Parts Nobody Talks About

You wrote the script. You set up the cron job. You added an alert. Done, right?

Wrong. The problems start three months later when the script silently fails, nobody notices for a week, and you're explaining to leadership why the data hasn't synced since Tuesday. Or worse—the alert fires at 3am, you wake up, scramble to log in, and have no idea what the hell you're looking at.

Most automation writeups cover the fun part: getting the thing to run. They skip the operational reality—the stuff that matters when things break and you're the one getting the call. Here's what actually matters.

What You're Actually Automating (And What Breaks)

Before you schedule anything, ask yourself: what does success look like, and how will I know it failed?

A script that runs is not automation. Automation is a process with inputs, outputs, failure modes, and a human who knows what to do when it breaks. The script is 20% of the work. The other 80% is everything that happens around it.

Example: I set up a Python script to pull analytics from three different APIs and dump them into a PostgreSQL database for a content site. Took maybe four hours to write. Then I spent another two days figuring out what happens when one API times out, when the database connection drops, when the API rate limits kick in, when the data format changes. The original script was the easy part.

Logging: The Thing Everyone Skips Until They Need It

If your script runs and prints nothing to a file, it didn't run. You have no way to debug, no way to audit, no way to prove anything happened when someone asks why the data is missing.

At minimum, every automated job needs:

  • A log file with timestamps
  • Input parameters (what did this run try to do?)
  • Output summary (how many records processed?)
  • Error messages with stack traces
  • A way to know if it ran at all

You don't need Splunk or Datadog for this. A simple log file with rotation via logrotate works fine for most stuff. The point is having something that captures what happened, when, and why it might have failed.

Failure mode: I once inherited a cron job that had been "running fine" for two years. It hadn't actually worked in eight months. No errors were being reported because the script was catching everything and printing nothing. The "success" was just the absence of complaints.

Alerts Are Easy to Create, Hard to Do Right

You can set up an alert in five minutes. Making it useful is harder.

The biggest mistake is alerting on "job didn't run." That's almost never useful. What you actually care about is "job ran and failed" or "job ran and produced unexpected results."

Example: a backup script that runs every night. Alerting on "backup didn't run" tells you nothing about whether the backup succeeded. You need to alert on "backup completed but file size is suspiciously small" or "backup failed with error code X."

This is where most monitoring setups fall apart. You get alert fatigue, start ignoring everything, and then miss the one alert that mattered.

Rule of thumb: if you can't explain what action the recipient should take when the alert fires, don't send it. "Something might be wrong" is not an alert. It's a note you should have written in a ticket, not a page that wakes someone up at 2am.

The Maintenance Reality Nobody Mentions

Your cron job isn't a one-time setup. It's a long-term commitment.

Things that will break:

  • API credentials expire and nobody tells you
  • The third-party API changes response format and your parser chokes
  • The server runs out of disk space and log files stop writing
  • Someone updates the underlying OS and your Python version is no longer supported
  • The service account loses permissions because IT policy changed
  • Time zones—seriously, time zones will bite you

You need a maintenance plan, not just a script. That means:

  • Document what the job does and who owns it
  • Schedule regular reviews (quarterly is reasonable)
  • Log where credentials are stored and who has access
  • Have a manual fallback procedure

Failure mode: I saw a critical data sync job break because the API token was stored in a config file that got rotated per company policy. Nobody knew the job existed. It took two weeks to notice the data was stale.

Testing Automation Is Different

You can't just python script.py and call it tested. You need to verify:

  • It handles the expected inputs without crashing
  • It fails gracefully on bad inputs
  • It produces the expected outputs
  • It works when the network is slow
  • It works when the external service is down
  • It cleans up after itself

For anything running in production, I usually run it manually at least once with real credentials before trusting it to cron. Then I monitor the first few runs closely.

The trap is treating "it didn't error" as success. A script can run without errors and still produce garbage output. Check the output. Verify the side effects. Confirm the downstream process saw the data.

What I Would Do First

If you're setting up a new cron job or automation:

  • Write the script, but put logging in from the start—don't add it later
  • Run it manually and verify the output is what you expect
  • Add it to cron with a schedule
  • Set up an alert for "job failed" that actually fires when it fails—not just when it doesn't run
  • Document what it does, where it logs, what credentials it uses, and who to wake up if it breaks
  • Put a calendar reminder to review it in three months

If you're auditing existing jobs (and you should be):

  • Check the logs. Are they being written? Are they being rotated?
  • Check the last run time. Is it actually running?
  • Check the outputs. Does the data look right?
  • Check the alerts. Do they still make sense?
  • Find the owner. If there's no owner, that's your first problem

The part everyone forgets is that automation is an operational commitment, not a one-time project. The script is the easy part. The rest is where you earn your keep.