Skip to content

fix: combine k8s proxy test images to eliminate duplicate interpreter layers#1138

Open
agentydragon wants to merge 1 commit intodevelfrom
claude/combine-k8s-proxy-test-images
Open

fix: combine k8s proxy test images to eliminate duplicate interpreter layers#1138
agentydragon wants to merge 1 commit intodevelfrom
claude/combine-k8s-proxy-test-images

Conversation

@agentydragon
Copy link
Copy Markdown
Owner

Summary

Combine mock_k8s_server and k8s_test_client into a single container image to eliminate duplicate Docker layer loads.

Root cause

py_image_layer embeds the binary name in runfiles paths (mock_k8s_server_bin.runfiles/... vs k8s_test_client_bin.runfiles/...), producing different layer digests for the identical Python 3.13 interpreter (~112 MB each). Docker loads this content twice — confirmed by comparing diffIDs in the OCI image configs.

Changes

  • New k8s_proxy_test_tools.py dispatcher: routes to mock_k8s_server or k8s_proxy_test_client based on K8S_PROXY_TEST_ROLE env var
  • Single aspect_py_binary with both scripts + all deps
  • size="medium" (300s) — default size="small" (60s) was causing intermittent timeouts (passing at 56s, timeout at 60.1s)

Measurements

Metric Before After
Docker loads 3 (99 + 145 + 144 = 388 MB) 2 (99 + 155 = 254 MB)
Test time (RBE) 45.3s 35.8s
Duplicate data eliminated 134 MB

Test plan

  • bazel test --nocache_test_results //devinfra/claude/hook_daemon/session_start:test_k8s_proxy_integration passes on RBE (35.8s)
  • Pre-commit hooks pass
  • CI on this branch

https://claude.ai/code/session_01ANqoTWWCxF71H5Aq2DqwnT

… layers

py_image_layer embeds the binary name in runfiles paths, so separate
binaries produce different layer digests for identical Python content.
mock_k8s_server and k8s_test_client each had a 112 MB interpreter
layer with different hashes — loaded twice by docker, never deduplicated.

Combine both scripts into a single aspect_py_binary with a dispatcher
(k8s_proxy_test_tools.py) that routes via K8S_PROXY_TEST_ROLE env var.
One image, one docker load, one interpreter layer.

Before: 3 docker loads (99 + 145 + 144 = 388 MB), test time 45.3s
After:  2 docker loads (99 + 155 = 254 MB), test time 35.8s

Also set size="medium" (300s) for safe headroom — the default
size="small" (60s) was causing intermittent timeouts on slower RBE
workers (passing runs at 56s, timeout at 60.1s).

https://claude.ai/code/session_01ANqoTWWCxF71H5Aq2DqwnT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants