fix(#5063): deterministically order env vars in DeployWorkload#5064
fix(#5063): deterministically order env vars in DeployWorkload#5064nalditopr wants to merge 2 commits intostacklok:mainfrom
Conversation
DeployWorkload built the StatefulSet pod's env list by iterating a Go map[string]string directly. Map iteration order is randomized by the runtime spec, so the resulting []EnvVar ordering changed between calls even when the contents were identical. Kubernetes hashes env list order into the pod template hash, so each reorder triggered a no-op StatefulSet rollout. In a deployed MCPServer this surfaced as repeated pod recreations after operator pod restarts, with secondary fallout: during the bridge recreation window the proxy runner's /health pinger blocks for its 5s timeout, racing the kubelet liveness probe (also 5s by default) and killing the proxy container. Each restart triggered another reconcile which re-iterated the env map and produced yet another order. Sort the keys before iterating so the env list is stable. Add a regression test that runs DeployWorkload 10+ times and asserts the StatefulSet's mcp container env order matches a fixed sorted slice; without the fix it fails reliably within a few iterations. Same class of bug as stacklok#4783 (vMCP discovery) and stacklok#3450 (vMCP server ordering), on the MCPServer code path.
ChrisJBurns
left a comment
There was a problem hiding this comment.
Multi-Agent Consensus Review
Agents consulted: orchestrator only (single-agent review — 13-line fix didn't warrant parallel specialist review)
Consensus Summary
| # | Finding | Consensus | Severity | Action |
|---|---|---|---|---|
| 1 | PR title uses conventional-commit prefix | n/a | INFO | Optional |
Overall
Tight, correct fix. The root-cause analysis is precise: Go map iteration is randomized, env list ordering feeds the K8s pod template hash, and any reorder triggers a StatefulSet rollout. Sorting keys before iteration is the right remedy and matches prior fixes (#4783, #3450).
I verified the other map iteration in this file (containerLabels at client.go:1301) writes into a destination map and is correctly left alone. The regression test asserts against an explicit sorted slice — documenting the contract rather than just stability — and runs 11 iterations to cover Go's map randomization variance.
The only nit is the title: .claude/rules/pr-creation.md says PR titles should not use conventional-commit prefixes (fix(...)). Recent history is mixed and this isn't strictly enforced, so non-blocking.
Generated with Claude Code
|
@nalditopr Thanks for this! I approved but the linter is having issues because of the complexity of the method, you able to put the env var sorting into a separate function? |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5064 +/- ##
==========================================
- Coverage 67.53% 67.53% -0.01%
==========================================
Files 601 601
Lines 61093 61097 +4
==========================================
- Hits 41262 41259 -3
- Misses 16714 16719 +5
- Partials 3117 3119 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Fixes #5063.
Summary
DeployWorkloadbuilt the StatefulSet pod's env list by iterating a Gomap[string]stringdirectly. Go map iteration order is randomized per spec, so the resulting[]EnvVarorder shifted between calls even when contents were identical. Kubernetes folds env list order into the pod template hash, so each reorder produced a "different" template and triggered a no-op StatefulSet rollout.In a real deployment this surfaced as repeated MCPServer pod recreations after the operator pod was rescheduled, with a nasty secondary effect: during the bridge recreation window, the proxy runner's
/healthpinger blocks waiting for its 5 s HTTP timeout, racing the kubelet liveness probe (also 5 s by default) — the kubelet wins, kills the proxy, the operator reconciles, the env map is iterated again with yet another order, and the cycle repeats.Fix
Sort the keys before iterating so the env list is stable across reconciles, regardless of map iteration order:
Same class of fix as #4783 (vMCP discovery) and #3450 (vMCP server ordering), on the
MCPServercode path that goes throughpkg/container/kubernetes/client.go.Regression test
Added
TestDeployWorkloadEnvVarOrderingDeterministicwhich:DeployWorkload11 times with the same env mapI confirmed the test fails reliably within a few iterations against the unfixed code (Go's map iteration is randomized aggressively enough that ordering shifts well within 10 calls) and passes on every run with the fix.
Notes / risk
containerLabelsat line 1310) writes into a target map, so order is irrelevant — left as is.corev1.EnvVarsemantics don't depend on position.