If your new hire can't resolve a P0 incident using only the documentation in their first month, that's not a problem with the hire. It's a problem with the documentation. Here's how to fix it.
There's a test I give every data team I work with: hand your runbooks to a new hire on day three and ask them to walk through a P1 incident. Not fix it — walk through it. Describe what they'd check, who they'd notify, what commands they'd run.
Most teams fail this test. Not because their engineers are bad. Because their documentation assumes too much.
The senior bottleneck
In most data teams, there's one or two engineers who are the institutional memory. They know which pipelines are fragile, which upstream sources are unreliable, which credentials expire and when. When something breaks, the page goes to them — even at 2 AM, even when they're on vacation.
This isn't a people problem. It's a documentation problem.
The senior engineer knows all of this because they built the system or they've fixed it before. That knowledge lives in their head because writing it down takes time and there's always something more urgent.
The result is a single point of failure on a human being.
What "new engineer can fix it" actually requires
For a new engineer to resolve an incident without escalation, the runbook needs to answer:
Context they don't have. What does this pipeline do? What's its business purpose? Who cares about it? This seems obvious to the team but is opaque to a new hire.
Credentials and access. Where are the credentials? What tool do they use? What permissions do they need? A new engineer shouldn't have to ask three people where the Snowflake service account is stored.
Diagnostic steps. Not "check the logs" — which logs? In which tool? What are you looking for? What does a healthy result look like versus a broken one?
Decision criteria. When do you escalate? At what point does this become a P0? Who do you call and what context do they need?
Commands, not concepts. The runbook should include the actual command, not a description of what the command does. Approximate knowledge isn't enough at 2 AM.
The documentation payoff
A runbook that a new engineer can follow on day three has two effects:
First, it compresses onboarding time. New engineers become productive on-call contributors in weeks instead of months. The senior engineer stops being woken up for incidents they could have prevented with thirty minutes of documentation.
Second, it surfaces gaps. Writing a runbook that a new engineer can follow forces you to make implicit knowledge explicit. Gaps in the runbook are gaps in the system's operability. That's useful information.
If your new hire needs to ask a senior to resolve a P0, the runbook isn't finished yet.
Comments
Sign in to leave a comment.