← All postsData engineering

Schema drift kills more pipelines
than any other incident.

Your upstream team renamed a column. Your pipeline doesn't know yet. This is the story of how schema drift accounts for 15% of all data incidents and how to build a runbook that handles it gracefully when it happens.

The upstream team shipped a breaking change on a Friday afternoon. They renamed user_id to userId in a Kafka event schema. No announcement. No migration guide. The change was "obvious" to them.

Your pipeline broke on Saturday morning. The on-call engineer spent four hours diagnosing it because the error message wasn't "column not found" — it was a silent null propagation that corrupted three downstream aggregations before anyone noticed.

Schema drift is the silent killer of data pipelines.

Why schema drift is so destructive

Other incident types fail loudly. A credential expiry throws a clear error. A network partition is obvious. A memory overflow has a stack trace.

Schema drift often fails silently. A renamed column produces nulls. A changed type coerces quietly. An added field gets ignored. The pipeline runs successfully — it just produces wrong data. By the time someone notices, the corruption has propagated downstream, backfill is complex, and stakeholder trust is damaged.

Industry estimates put schema drift at 10–20% of all data quality incidents. In teams with heavy event-driven pipelines from multiple upstream sources, the number is higher.

The three types of schema drift

Breaking changes — renamed columns, removed fields, changed types. These usually fail loudly if you have schema validation. If you don't, they fail silently.

Additive changes — new fields, new enum values, new event types. These rarely break pipelines immediately but can break downstream logic if the pipeline was written to expect a fixed schema.

Semantic changes — the field name didn't change, but what it means did. conversion_value used to be in USD. Now it's in the user's local currency. This is the most dangerous type because no monitoring catches it.

Building a schema drift runbook

A good schema drift runbook has three phases:

Detection. What are your signals that schema drift has occurred? Check your schema registry if you have one. Check upstream changelog or Slack channels. Look for unexpected nulls in key fields. Look for cardinality changes in categorical columns.

Impact assessment. Which downstream tables, reports, or dashboards depend on the affected field? Is the corruption already in your warehouse? How far back does it go?

Resolution. For breaking changes, coordinate with the upstream team on rollback or migration. For silent corruption, identify the backfill window, build the correction query, communicate the timeline to stakeholders.

The most important part of a schema drift runbook is the impact assessment. Most teams skip it and fix the pipeline before understanding what data is already corrupted. That's how a two-hour fix turns into a three-day data quality incident.

ShareLinkedIn

Comments

Sign in to leave a comment.