← All postsKnowledge management

Runbooks rot.
Here's how to keep them alive.

A runbook that was accurate eight months ago but references a deprecated tool and an engineer who left is worse than no runbook. It gives false confidence. Three practices that keep runbooks from going stale.

The runbook said to contact Sarah in the escalation path. Sarah left eight months ago. The runbook said to check the Airflow 1.x UI at a URL that no longer exists. The runbook said the pipeline ran on a schedule that changed in Q2.

The engineer following the runbook at 2 AM trusted it. They spent 40 minutes trying to reach Sarah before calling the right person. They spent 20 minutes on a URL that returned a 404.

A stale runbook doesn't fail loudly. It fails by consuming time and trust at the worst possible moment.

Why runbooks rot

Runbooks decay for the same reason all documentation decays: systems change continuously, but documentation updates require deliberate effort that competes with everything else.

The three fastest decay vectors:

Personnel changes. Escalation paths reference people by name. When those people change roles, leave the company, or change their contact information, the runbook becomes a liability. This is the most common cause of runbook rot because team composition changes are frequent and rarely trigger a documentation update.

Tool and infrastructure changes. Runbooks reference specific URLs, tools, and commands. A cloud migration, a tool upgrade, or an infrastructure change can invalidate dozens of procedural steps at once. These changes are high-impact but often undocumented from a runbook perspective.

Pipeline changes. Schedules change. Schemas change. Upstream sources change. Dependencies change. A runbook written for a pipeline at v1 may be misleading for a pipeline at v4.

Three practices that actually work

1. Incident-driven updates. After every incident, the runbook is updated as part of the post-incident review. Not the entire runbook — just the sections that were used, and specifically the parts that were wrong, missing, or unclear. This is the most sustainable approach because it ties documentation effort to demonstrated gaps.

2. Quarterly personnel audits. Once a quarter, do a five-minute scan of every escalation path in every runbook. Verify that the people named are still in the right roles, their contact information is current, and their responsibilities haven't changed. This is low effort and prevents the most common failure mode.

3. Change-triggered reviews. Any time a significant infrastructure change is made — a migration, a tool upgrade, a schema change — include a runbook review in the change checklist. The person making the change is the one who knows what the runbook needs to reflect. Attach it to the ticket.

The version history signal

One of the most reliable signals that a runbook is stale is its version history. If a runbook hasn't been updated in six months but its pipeline has been modified three times, something is wrong. The version history tells you when to be skeptical.

This is why ShieldSet tracks every runbook update with a version number, a timestamp, and a change note. When you open a runbook and see that it was last updated two weeks ago by the engineer who owns the pipeline, you can trust it. When you see it was last updated fourteen months ago, you know to verify before you act.

Documentation that tells you how old it is and who wrote it is documentation you can calibrate against. Documentation with no history is documentation you can't trust.

ShareLinkedIn

Comments

Sign in to leave a comment.