← All postsTooling

Airflow at 2 AM:
a diagnostic field guide.

The scheduler crashed. Or the DAG is stuck. Or the executor ran out of memory. Here's the ordered checklist for diagnosing Airflow failures fast — drawn from a decade of late-night incidents.

It's 2 AM. The alert fired. The DAG that feeds the morning executive dashboard is stuck. You have until 6 AM before anyone notices.

Here's how to diagnose it fast.

Step 1: Identify the failure mode (2 min)

Before you touch anything, identify which of the four common Airflow failure modes you're dealing with:

Scheduler failure — The scheduler process has died or stopped processing. Signs: no new task instances are being created even though DAGs are scheduled. Check: airflow scheduler status or look for scheduler heartbeat in the Airflow UI (Admin > Scheduler Jobs).

Task failure — One or more tasks in the DAG failed. Signs: the DAG run shows a failed state with specific task failures. Check: click into the DAG run in the UI and look for red task instances.

DAG import error — The DAG file has a Python syntax error or import failure. Signs: the DAG disappeared from the UI or shows an import error. Check: airflow dags list-import-errors.

Executor resource exhaustion — The executor (Celery, Kubernetes) ran out of workers or resources. Signs: tasks are in a queued state for longer than expected without starting. Check: your executor-specific monitoring (Celery Flower, Kubernetes pod status).

Step 2: Scheduler failure path

If the scheduler is down:

  1. Check the scheduler logs: journalctl -u airflow-scheduler -n 100 (systemd) or your log aggregation tool.
  2. Look for OOM kills: dmesg | grep -i "killed process"
  3. Restart the scheduler: systemctl restart airflow-scheduler or your equivalent.
  4. Monitor for 5 minutes to confirm it stays up and processes DAG runs.

If it keeps dying, the root cause is usually memory pressure (increase scheduler memory limit), a corrupted DAG file that triggers a crash on parsing, or a database connection pool exhaustion.

Step 3: Task failure path

If specific tasks are failing:

  1. Click into the failed task in the UI and open the log.
  2. The last 20 lines of the log contain the error in 90% of cases.
  3. Common causes: expired credentials (look for 401/403 errors or "authentication failed"), upstream data unavailability (look for "table not found" or "file not found"), resource limits (look for OOM or timeout errors).
  4. For credential failures: rotate the credential in your secret store, update the Airflow connection, clear the failed task and retry.
  5. For upstream failures: check the upstream source status, assess whether to wait or backfill.

Step 4: Executor resource exhaustion path

If tasks are stuck in queued state:

Celery executor: Check Flower at /flower on your Airflow host. Are workers online? Are they processing tasks? If workers are offline, restart them. If they're online but not processing, check for task serialization errors in the worker logs.

Kubernetes executor: kubectl get pods -n airflow | grep task. Are pod creation requests pending? Check node capacity with kubectl describe nodes. If nodes are under pressure, you may need to scale the node group or clear stuck pods manually.

The 80% rule

In practice, 80% of 2 AM Airflow incidents fall into three categories: expired credentials, an upstream source being down, or a scheduler process that ran out of memory. Building your diagnostic runbook around these three cases first will resolve most incidents before you need to go deeper.

The other 20% — DAG bugs, executor deadlocks, database corruption — require more investigation. But by the time you've ruled out the 80% cases, you have enough information to escalate with context rather than speculation.

ShareLinkedIn

Comments

Sign in to leave a comment.