P0 through P3 aren't just labels. They're a contract with your stakeholders about response time and escalation. Most teams skip classification entirely and go straight to debugging. That's how P1s turn into P0s.
When a pipeline fails at 11 PM, the first instinct is to fix it. Open the logs, find the error, restart the job. Skip the preamble and get to the work.
That instinct is understandable. It's also how P1 incidents become P0 incidents.
What severity actually means
Severity is not about how hard the problem is to fix. It's about what the impact is while it's broken.
A P0 is an incident that is actively costing revenue, breaching a customer SLA, or exposing a data quality failure to end users. It requires an immediate response — everyone drops what they're doing.
A P1 is an incident that will breach an internal SLA if not resolved within 30 minutes. It requires a rapid response but not a fire drill.
A P2 is a non-critical delay — a pipeline is late, but no external commitment is at risk. It needs to be fixed, but it can wait for the morning.
A P3 is a low-priority issue. A deprecated pipeline failed. A report nobody reads is stale. Handle it during business hours.
Why teams skip it
Classification feels like bureaucracy when you're staring at a broken pipeline. The instinct is to skip directly to debugging and figure out the severity after the fact.
The problem is that severity determines everything that comes next:
- Who else needs to know. A P0 means stakeholders should be notified immediately. A P3 can wait for the next standup.
- Who to escalate to. P0 wakes up the data platform lead. P3 goes in the backlog.
- How much time you have. P1 means you have 30 minutes before someone notices. P2 means you can think.
- Whether to backfill. P0 almost always requires backfill. P3 rarely does.
When you skip classification, you make all of these decisions implicitly — often incorrectly. A P1 gets treated like a P3 because the on-call engineer didn't want to escalate at midnight. The SLA breaches. Now it's a P0.
Building classification into your runbooks
The fix is making severity classification the first step of every runbook — not a checkbox at the end.
Your runbook's priority level section should define exactly what makes something a P0 versus a P1 for that specific pipeline. Generic definitions don't work. "Revenue-impacting" means something different for a pipeline feeding the billing dashboard versus one feeding the internal analytics tool.
Specific, pipeline-level severity definitions mean that classification is a lookup, not a judgment call. That's how you get consistent responses at 2 AM when the engineer on call is half-asleep and hasn't seen this pipeline before.
Comments
Sign in to leave a comment.