Runbooks — ShieldSet

Runbook anatomy

Every ShieldSet runbook
covers six critical sections.

The structure is designed so that any engineer — or any AI agent — can execute the steps without needing additional context from the person who wrote it.

Priority Level	Categorize severity by business impact, not technical complexity. A P0 is revenue-impacting or customer-facing data that is stale or incorrect — regardless of how simple the underlying fix might be. Severity determines response time and escalation path, so classification happens first.
Impact Scope	Before debugging, understand the downstream impact. Which dashboards, models, reports, and data consumers depend on the failing pipeline? This step prevents the common mistake of fixing the pipeline while ignoring that downstream transformations also need a backfill.
Fault Finder	An ordered checklist that narrows the root cause systematically. Source system availability → credential and permission check → schema change detection → data volume anomaly → infrastructure resource limits → code regression. Work through each check in sequence; each either resolves the issue or eliminates a category of cause.
Fix Playbook	Step-by-step instructions for each known failure mode — specific enough to execute under pressure at 2 AM. For credential expiry: rotate in secrets manager, update the Airflow connection, trigger the backfill. For schema drift: identify the changed columns, update the schema mapping, reprocess affected partitions.
Escalation Path	Who to contact and when — time-bound and severity-based, not dependent on who happens to be online. Every escalation should include the context already gathered so that a senior engineer doesn't have to start the diagnostic process from scratch.
Incident Debrief	A mandatory close-out template: root cause, incident timeline, action items, and a specific prompt to update the runbook. A runbook that was accurate six months ago but hasn't been updated since is worse than no runbook — it gives false confidence. This section keeps runbooks alive.

Common incident types

The scenarios that cause
80% of pipeline incidents.

ShieldSet generates tailored Fix Playbooks for each of these failure modes based on your specific stack and pipeline configuration.

Credential expiry

25%

OAuth token or service account key rotation without pipeline update

Rotate credentials in secrets manager, restart pipeline, verify token freshness

Source API rate limiting

20%

Increased data volume or concurrent consumers hitting API caps

Implement exponential backoff, request rate limit increase from upstream team

Schema drift

15%

Upstream team altered table structure — renamed or removed columns

Update pipeline schema mapping, backfill affected partitions from last good state

Data quality regression

15%

Null values, duplicates, or business logic violation in source data

Identify bad records, quarantine, reprocess from last known good checkpoint

Infrastructure OOM

10%

Data volume spike exceeding allocated memory or compute resources

Increase executor memory, optimize query partitioning, add data range filter

Orchestrator failure

10%

Airflow scheduler crash, DAG parsing error, or stuck task queue

Restart scheduler, fix DAG syntax, clear stuck tasks from queue

Network / connectivity

VPN, firewall rule change, or DNS resolution failure

Verify network path, check security group rules, escalate to infrastructure

Runbook culture

Writing the runbook
is the easy part.

The hard part is keeping it alive. ShieldSet builds the practices that make runbooks useful six months after they're written.

Post-incident updates are mandatory.

Every incident review ends with a specific action item: update the runbook or create a new one. ShieldSet prompts this automatically at the close of every incident, so gaps close immediately — not three months later when the same failure recurs.

Quarterly validation cadence.

Schedule a recurring team review where each runbook is validated against current infrastructure. If a runbook references a deprecated tool or an outdated escalation contact, ShieldSet flags it. A runbook that hasn't been validated in 90 days shows a staleness warning.

Ready to turn your tribal knowledge
into institutional infrastructure?

Generate your first AI-powered runbook in under five minutes. Describe your pipeline in plain language — ShieldSet handles the rest.

Start free — no card required See how the AI works

Not a wiki page.A decision tree anyengineer can follow.

Every ShieldSet runbookcovers six critical sections.

The scenarios that cause80% of pipeline incidents.

Writing the runbookis the easy part.