CeleryExecutor worker health check does not detect catatonic state after Redis broker restart

### Apache Airflow version

3.1.7

### If "Other Airflow 2 version" selected, which one?

_No response_

### What happened?

After a Redis broker restart (e.g., VM reboot), the Celery worker reconnects at the transport level but silently loses its consumer registration on the task queue. The worker process stays alive, `celery inspect ping` returns OK, but `inspect.active_queues()` returns `None` — the worker is in a **catatonic state** where it accepts no new tasks.

This is a known upstream Celery bug (celery/celery#8030, celery/celery#9054, celery/celery#8990) that has persisted across Celery 5.2.x through 5.5.x with Redis broker. The partial fix in celery/celery#8796 did not fully resolve it.

**The problem on Airflow's side**: The current worker liveness/health check mechanism does not detect this state. Since `celery inspect ping` responds normally, Docker/Kubernetes health probes pass, and the worker is never restarted. Tasks accumulate in the Redis queue with state `queued`, hit the scheduler's requeue limit, and are marked `failed`.

In our case, 302 tasks piled up in the `default` queue over ~14 hours before we noticed. 7 DAGs failed, all with `Task requeue attempts exceeded max; marking failed`.

### What you think should happen instead?

The Airflow Celery worker health check should verify that the worker has active queue consumers, not just that it responds to ping. Specifically:

```python
# Current check (insufficient):
result = app.control.inspect().ping()
# Returns OK even in catatonic state

# Proposed additional check:
queues = app.control.inspect().active_queues()
if queues is None or worker_name not in queues:
    # Worker is alive but not consuming — health check should FAIL
    return False
```

If the worker is alive but has no registered queues, the health check should fail, triggering a container restart via the orchestrator (Docker, Kubernetes, systemd).

This would go in the `airflow-providers-celery` package, likely in the worker CLI health check logic.

### How to reproduce

1. Start Airflow with CeleryExecutor and Redis broker
2. Confirm worker is consuming tasks normally
3. Restart Redis (e.g., `docker restart redis` or reboot the Redis host)
4. Observe: worker process stays alive, `celery inspect ping` returns OK
5. `celery inspect active_queues` returns `None` or empty for the worker
6. Schedule a DAG — task goes to `queued` state and is never picked up
7. After scheduler requeue attempts, task is marked `failed`

### Operating System

Debian 12 (Proxmox VM)

### Versions of Apache Airflow Providers

apache-airflow-providers-celery (installed with Airflow 3.1.7)

### Deployment

Docker Compose

### Deployment details

- Airflow 3.1.7 (Docker image `apache/airflow:3.1.7-python3.11`)
- Celery worker with 16 fork pool workers
- Redis 7.2.10 as broker
- PostgreSQL as result backend
- `worker_prefetch_multiplier = 1`

### Anything else?

**Related upstream Celery issues:**
- celery/celery#8030
- celery/celery#8091
- celery/celery#8990 (confirms not fixed in 5.4.0)
- celery/celery#9054
- celery/celery#9191

**Previous Airflow issues (closed as upstream):**
- #26542
- #32484
- #24498
- #27032

Those issues were rightfully closed as upstream Celery bugs. This issue proposes a **defensive fix on Airflow's side** — improving the health check to detect and recover from the catatonic state, regardless of when Celery fixes the root cause.

**Workarounds:**
- `--without-heartbeat --without-gossip --without-mingle` avoids the code path but loses cluster features
- `broker_connection_retry = False` makes the worker crash on broker loss (requires restart policy)
- Custom health check script checking `active_queues()` instead of `ping`

I'm willing to work on a PR for this.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CeleryExecutor worker health check does not detect catatonic state after Redis broker restart #63580

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CeleryExecutor worker health check does not detect catatonic state after Redis broker restart #63580

Description

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions