Skip to content

CeleryExecutor worker health check does not detect catatonic state after Redis broker restart #63580

@antonio-mello-ai

Description

@antonio-mello-ai

Apache Airflow version

3.1.7

If "Other Airflow 2 version" selected, which one?

No response

What happened?

After a Redis broker restart (e.g., VM reboot), the Celery worker reconnects at the transport level but silently loses its consumer registration on the task queue. The worker process stays alive, celery inspect ping returns OK, but inspect.active_queues() returns None — the worker is in a catatonic state where it accepts no new tasks.

This is a known upstream Celery bug (celery/celery#8030, celery/celery#9054, celery/celery#8990) that has persisted across Celery 5.2.x through 5.5.x with Redis broker. The partial fix in celery/celery#8796 did not fully resolve it.

The problem on Airflow's side: The current worker liveness/health check mechanism does not detect this state. Since celery inspect ping responds normally, Docker/Kubernetes health probes pass, and the worker is never restarted. Tasks accumulate in the Redis queue with state queued, hit the scheduler's requeue limit, and are marked failed.

In our case, 302 tasks piled up in the default queue over ~14 hours before we noticed. 7 DAGs failed, all with Task requeue attempts exceeded max; marking failed.

What you think should happen instead?

The Airflow Celery worker health check should verify that the worker has active queue consumers, not just that it responds to ping. Specifically:

# Current check (insufficient):
result = app.control.inspect().ping()
# Returns OK even in catatonic state

# Proposed additional check:
queues = app.control.inspect().active_queues()
if queues is None or worker_name not in queues:
    # Worker is alive but not consuming — health check should FAIL
    return False

If the worker is alive but has no registered queues, the health check should fail, triggering a container restart via the orchestrator (Docker, Kubernetes, systemd).

This would go in the airflow-providers-celery package, likely in the worker CLI health check logic.

How to reproduce

  1. Start Airflow with CeleryExecutor and Redis broker
  2. Confirm worker is consuming tasks normally
  3. Restart Redis (e.g., docker restart redis or reboot the Redis host)
  4. Observe: worker process stays alive, celery inspect ping returns OK
  5. celery inspect active_queues returns None or empty for the worker
  6. Schedule a DAG — task goes to queued state and is never picked up
  7. After scheduler requeue attempts, task is marked failed

Operating System

Debian 12 (Proxmox VM)

Versions of Apache Airflow Providers

apache-airflow-providers-celery (installed with Airflow 3.1.7)

Deployment

Docker Compose

Deployment details

  • Airflow 3.1.7 (Docker image apache/airflow:3.1.7-python3.11)
  • Celery worker with 16 fork pool workers
  • Redis 7.2.10 as broker
  • PostgreSQL as result backend
  • worker_prefetch_multiplier = 1

Anything else?

Related upstream Celery issues:

Previous Airflow issues (closed as upstream):

Those issues were rightfully closed as upstream Celery bugs. This issue proposes a defensive fix on Airflow's side — improving the health check to detect and recover from the catatonic state, regardless of when Celery fixes the root cause.

Workarounds:

  • --without-heartbeat --without-gossip --without-mingle avoids the code path but loses cluster features
  • broker_connection_retry = False makes the worker crash on broker loss (requires restart policy)
  • Custom health check script checking active_queues() instead of ping

I'm willing to work on a PR for this.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind:bugThis is a clearly a bugpriority:highHigh priority bug that should be patched quickly but does not require immediate new releaseprovider:celery

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions