Runbooks

Not a wiki page.
A decision tree any
engineer can follow.

An effective runbook tells an on-call engineer exactly how to diagnose and resolve a specific failure — from “pipeline X failed” to “rerun step 3 after clearing the staging table.” ShieldSet makes every runbook that specific.

Runbook anatomy

Every ShieldSet runbook
covers six critical sections.

The structure is designed so that any engineer — or any AI agent — can execute the steps without needing additional context from the person who wrote it.

Priority LevelCategorize severity by business impact, not technical complexity. A P0 is revenue-impacting or customer-facing data that is stale or incorrect — regardless of how simple the underlying fix might be. Severity determines response time and escalation path, so classification happens first.
Impact ScopeBefore debugging, understand the downstream impact. Which dashboards, models, reports, and data consumers depend on the failing pipeline? This step prevents the common mistake of fixing the pipeline while ignoring that downstream transformations also need a backfill.
Fault FinderAn ordered checklist that narrows the root cause systematically. Source system availability → credential and permission check → schema change detection → data volume anomaly → infrastructure resource limits → code regression. Work through each check in sequence; each either resolves the issue or eliminates a category of cause.
Fix PlaybookStep-by-step instructions for each known failure mode — specific enough to execute under pressure at 2 AM. For credential expiry: rotate in secrets manager, update the Airflow connection, trigger the backfill. For schema drift: identify the changed columns, update the schema mapping, reprocess affected partitions.
Escalation PathWho to contact and when — time-bound and severity-based, not dependent on who happens to be online. Every escalation should include the context already gathered so that a senior engineer doesn't have to start the diagnostic process from scratch.
Incident DebriefA mandatory close-out template: root cause, incident timeline, action items, and a specific prompt to update the runbook. A runbook that was accurate six months ago but hasn't been updated since is worse than no runbook — it gives false confidence. This section keeps runbooks alive.
Common incident types

The scenarios that cause
80% of pipeline incidents.

ShieldSet generates tailored Fix Playbooks for each of these failure modes based on your specific stack and pipeline configuration.

Incident type
Freq.
Typical root cause
Resolution pattern
Credential expiry
25%
OAuth token or service account key rotation without pipeline update
Rotate credentials in secrets manager, restart pipeline, verify token freshness
Source API rate limiting
20%
Increased data volume or concurrent consumers hitting API caps
Implement exponential backoff, request rate limit increase from upstream team
Schema drift
15%
Upstream team altered table structure — renamed or removed columns
Update pipeline schema mapping, backfill affected partitions from last good state
Data quality regression
15%
Null values, duplicates, or business logic violation in source data
Identify bad records, quarantine, reprocess from last known good checkpoint
Infrastructure OOM
10%
Data volume spike exceeding allocated memory or compute resources
Increase executor memory, optimize query partitioning, add data range filter
Orchestrator failure
10%
Airflow scheduler crash, DAG parsing error, or stuck task queue
Restart scheduler, fix DAG syntax, clear stuck tasks from queue
Network / connectivity
5%
VPN, firewall rule change, or DNS resolution failure
Verify network path, check security group rules, escalate to infrastructure
Runbook culture

Writing the runbook
is the easy part.

The hard part is keeping it alive. ShieldSet builds the practices that make runbooks useful six months after they're written.

Post-incident updates are mandatory.

Every incident review ends with a specific action item: update the runbook or create a new one. ShieldSet prompts this automatically at the close of every incident, so gaps close immediately — not three months later when the same failure recurs.

Quarterly validation cadence.

Schedule a recurring team review where each runbook is validated against current infrastructure. If a runbook references a deprecated tool or an outdated escalation contact, ShieldSet flags it. A runbook that hasn't been validated in 90 days shows a staleness warning.

Ready to turn your tribal knowledge
into institutional infrastructure?

Generate your first AI-powered runbook in under five minutes. Describe your pipeline in plain language — ShieldSet handles the rest.