Before you start debugging, you need to know what's downstream. Most engineers jump straight to root cause. That's why they often fix the pipeline but miss the backfill that three dashboards needed.
The pipeline broke at 3 AM. The on-call engineer found the bug, fixed the code, restarted the job, and went back to sleep. Incident resolved.
Except it wasn't. Three dashboards that depended on that pipeline had been serving stale data for six hours. Two of them were used in the 8 AM executive review. Nobody backfilled. Nobody notified anyone.
The fix took 45 minutes. The trust damage took months.
The blast radius question
The first question in any data incident isn't "what broke?" — it's "what's downstream?"
Blast radius is the set of systems, reports, and stakeholders affected by a pipeline failure. Understanding it before you start debugging determines everything: how urgently you need to fix this, who you need to notify, and what cleanup is required after the fix.
Most engineers skip this question. Root cause is more interesting than impact mapping. The instinct is to find the bug first and figure out the consequences later. That's backwards.
Why it matters
Severity depends on it. A pipeline failure that affects a deprecated internal tool is a P3. The same failure affecting the dashboard used in the board presentation tomorrow is a P0. You can't classify severity without knowing what's downstream.
Notification depends on it. Stakeholders who depend on affected data need to know — not after the fix, but while you're working on it. "We're aware and working on it" is a completely different experience than discovering stale data in a meeting.
Backfill depends on it. Fixing the pipeline doesn't fix the data. If downstream tables, dashboards, or ML features were built on incomplete data during the outage window, they need to be recomputed. This is frequently missed.
Building blast radius into your runbooks
Your runbook's impact scope section should answer these questions specifically for each pipeline:
- Which downstream tables depend on this pipeline's output?
- Which dashboards or reports consume those tables?
- Which teams or stakeholders use those dashboards?
- Is backfill required when this pipeline fails? How far back?
- What's the data freshness SLA for each downstream consumer?
This isn't a document you write once — it needs to be updated when the pipeline's consumers change. A runbook with an outdated impact scope is worse than no runbook because it gives false confidence about who's affected.
The engineers who fix incidents fastest aren't the ones who find root cause fastest. They're the ones who understand blast radius and can coordinate response and communication in parallel with the technical fix. That's the skill that separates reactive firefighting from structured incident response.
Comments
Sign in to leave a comment.