Skip to content

Retry Kubernetes 410 Gone errors in watch streams#218

Draft
Copilot wants to merge 3 commits intomasterfrom
copilot/fix-workflow-error-resource-version
Draft

Retry Kubernetes 410 Gone errors in watch streams#218
Copilot wants to merge 3 commits intomasterfrom
copilot/fix-workflow-error-resource-version

Conversation

Copy link
Copy Markdown

Copilot AI commented Feb 4, 2026

Long-running jobs fail when Kubernetes watch API returns 410 (resource version expired). The existing retry logic blocks all 4xx errors, but 410 is a recoverable condition that should trigger a watch restart.

Changes

  • calrissian/retry.py: Added _is_retryable_4xx() to classify 410 as retryable. Updated retry condition from retry_on_type & retry_not_4xx to retry_on_type & (retry_not_4xx | retry_on_410).
  • Tests: Added coverage for 410 retry behavior in test_retry.py and test_k8s.py.

Behavior

# Before: 410 errors blocked by 4xx filter
retry(retry=retry_on_type & retry_not_4xx)  # 410 → immediate failure

# After: 410 errors explicitly allowed
retry(retry=retry_on_type & (retry_not_4xx | retry_on_410))  # 410 → retry

Other 4xx errors (400, 403, 404) remain non-retryable. 5xx and network errors continue to retry as before.

Original prompt

This section details on the original issue you should resolve

<issue_title>Got workflow error: "too old resource version: 986858921 (987192637)"</issue_title>
<issue_description>We got this error in a long running job.
I could not reproduce yet. The stacktrace came out scrambled:

ERROR -  cwltool.errors.WorkflowException: (410)
INFO  -  (410)
INFO  -  Reason: Expired: too old resource version: 986858921 (987192637)[0m
INFO  -  Traceback (most recent call last):
INFO  -      raise exceptions[0]
INFO  -      result = self.fn(*self.args, **self.kwargs)
INFO  -      completion_result = self.wait_for_kubernetes_pod()
INFO  -    File "/app/envs/calrissian/lib/python3.12/site-packages/tenacity/__init__.py", line 289, in wrapped_f
INFO  -      return self(f, *args, **kw)
INFO  -            ^^^^^^^^^^^^^^^^^^^
INFO  -      do = self.iter(retry_state=retry_state)
INFO  -           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  -      raise self.last_attempt.result()
INFO  -            ^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  -    File "/app/envs/calrissian/lib/python3.12/site-packages/tenacity/__init__.py", line 382, in __call__
INFO  -               ^^^^^^^^^^^^^^^^^^^
INFO  -      for event in w.stream(self.core_api_instance.list_namespaced_pod, self.namespace, field_selector=self._get_pod_field_selector()):
INFO  -                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  -      raise client.rest.ApiException(
INFO  -  Reason: Expired: too old resource version: 986858921 (987192637)
INFO  -      (out, status) = real_executor(
INFO  -                      ^^^^^^^^^^^^^^
INFO  -    File "/app/envs/calrissian/lib/python3.12/site-packages/cwltool/executors.py", line 146, in execute
INFO  -      self.run_jobs(process, job_order_object, logger, runtime_context)
INFO  -             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  -      futures = self.enqueue_jobs_from_iterator(job_iterator, logger, runtime_context, pool_executor)
INFO  -                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  -    File "/app/calrissian/executor.py", line 385, in enqueue_jobs_from_iterator
INFO  -      raise WorkflowException(str(err)) from err
INFO  -  Starting Cleanup
INFO  -    File "/app/calrissian/executor.py", line 265, in raise_if_exception_queued
INFO  -    File "/home/calrissian/.local/share/hatch/env/virtual/.pythons/3.12/python/lib/python3.12/concurrent/futures/thread.py", line 58, in run
INFO  -               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  -    File "/app/calrissian/job.py", line 719, in run
INFO  -                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  -    File "/app/calrissian/job.py", line 418, in wait_for_kubernetes_pod
INFO  -      return self.client.wait_for_completion()
INFO  -             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  -    File "/app/envs/calrissian/lib/python3.12/site-packages/tenacity/__init__.py", line 379, in __call__
INFO  -    File "/app/envs/calrissian/lib/python3.12/site-packages/tenacity/__init__.py", line 325, in iter
INFO  -      raise retry_exc.reraise()
INFO  -    File "/app/envs/calrissian/lib/python3.12/site-packages/tenacity/__init__.py", line 158, in reraise
INFO  -    File "/home/calrissian/.local/share/hatch/env/virtual/.pythons/3.12/python/lib/python3.12/concurrent/futures/_base.py", line 449, in result
INFO  -      return self.__get_result()
INFO  -             ^^^^^^^^^^^^^^^^^^^
INFO  -    File "/home/calrissian/.local/share/hatch/env/virtual/.pythons/3.12/python/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
INFO  -      raise self._exception
INFO  -      result = fn(*args, **kwargs)
INFO  -             ^^^^^^^^^^^^^^^^^^^^
INFO  -    File "/app/calrissian/k8s.py", line 163, in wait_for_completion
INFO  -    File "/app/envs/calrissian/lib/python3.12/site-packages/kubernetes/watch/watch.py", line 182, in stream
INFO  -  kubernetes.client.exceptions.ApiException: (410)
INFO  -  (empty)
INFO  -  The above exception was the direct cause of the following exception:
INFO  -    File "/app/envs/calrissian/lib/python3.12/site-packages/cwltool/main.py", line 1314, in main
INFO  -    File "/app/envs/calrissian/lib/python3.12/site-packages/cwltool/executors.py", line 63, in __call__
INFO  -      return self.execute(process, job_order_object, runtime_context, logger)
INFO  -    File "/app/calrissian/executor.py", line 427, in run_jobs
INFO  -      self.raise_if_exception_queued(futures, logger)
INFO  -    File "/app/calrissian/executor.py", line 270, in raise_if_exception_queued
```</issue_description>

<agent_instructions>manage 410 in kubernetes</agent_instructions>

## Comments on the Issue (you are @copilot in this section)

<comments>
</comments>


</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes Duke-GCB/calrissian#217

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

Copilot AI and others added 2 commits February 4, 2026 16:11
Co-authored-by: fabricebrito <1178901+fabricebrito@users.noreply.github.com>
Co-authored-by: fabricebrito <1178901+fabricebrito@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix workflow error due to old resource version Retry Kubernetes 410 Gone errors in watch streams Feb 4, 2026
Copilot AI requested a review from fabricebrito February 4, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants