Skip to content

fix(#5063): deterministically order env vars in DeployWorkload#5064

Open
nalditopr wants to merge 2 commits intostacklok:mainfrom
nalditopr:fix/mcpserver-env-deterministic-ordering
Open

fix(#5063): deterministically order env vars in DeployWorkload#5064
nalditopr wants to merge 2 commits intostacklok:mainfrom
nalditopr:fix/mcpserver-env-deterministic-ordering

Conversation

@nalditopr
Copy link
Copy Markdown

Fixes #5063.

Summary

DeployWorkload built the StatefulSet pod's env list by iterating a Go map[string]string directly. Go map iteration order is randomized per spec, so the resulting []EnvVar order shifted between calls even when contents were identical. Kubernetes folds env list order into the pod template hash, so each reorder produced a "different" template and triggered a no-op StatefulSet rollout.

In a real deployment this surfaced as repeated MCPServer pod recreations after the operator pod was rescheduled, with a nasty secondary effect: during the bridge recreation window, the proxy runner's /health pinger blocks waiting for its 5 s HTTP timeout, racing the kubelet liveness probe (also 5 s by default) — the kubelet wins, kills the proxy, the operator reconciles, the env map is iterated again with yet another order, and the cycle repeats.

Fix

Sort the keys before iterating so the env list is stable across reconciles, regardless of map iteration order:

envKeys := make([]string, 0, len(envVars))
for k := range envVars {
    envKeys = append(envKeys, k)
}
sort.Strings(envKeys)
envVarList := make([]*corev1apply.EnvVarApplyConfiguration, 0, len(envKeys))
for _, k := range envKeys {
    envVarList = append(envVarList, corev1apply.EnvVar().WithName(k).WithValue(envVars[k]))
}

Same class of fix as #4783 (vMCP discovery) and #3450 (vMCP server ordering), on the MCPServer code path that goes through pkg/container/kubernetes/client.go.

Regression test

Added TestDeployWorkloadEnvVarOrderingDeterministic which:

  • Calls DeployWorkload 11 times with the same env map
  • Reads back the StatefulSet's mcp container env from the fake clientset
  • Asserts the env list matches a fixed sorted slice on every call

I confirmed the test fails reliably within a few iterations against the unfixed code (Go's map iteration is randomized aggressively enough that ordering shifts well within 10 calls) and passes on every run with the fix.

Notes / risk

  • The other map iteration in this file (containerLabels at line 1310) writes into a target map, so order is irrelevant — left as is.
  • No behavior change other than ordering; corev1.EnvVar semantics don't depend on position.

DeployWorkload built the StatefulSet pod's env list by iterating a Go
map[string]string directly. Map iteration order is randomized by the
runtime spec, so the resulting []EnvVar ordering changed between calls
even when the contents were identical. Kubernetes hashes env list order
into the pod template hash, so each reorder triggered a no-op
StatefulSet rollout.

In a deployed MCPServer this surfaced as repeated pod recreations after
operator pod restarts, with secondary fallout: during the bridge
recreation window the proxy runner's /health pinger blocks for its 5s
timeout, racing the kubelet liveness probe (also 5s by default) and
killing the proxy container. Each restart triggered another reconcile
which re-iterated the env map and produced yet another order.

Sort the keys before iterating so the env list is stable. Add a
regression test that runs DeployWorkload 10+ times and asserts the
StatefulSet's mcp container env order matches a fixed sorted slice;
without the fix it fails reliably within a few iterations.

Same class of bug as stacklok#4783 (vMCP discovery) and stacklok#3450 (vMCP server
ordering), on the MCPServer code path.
Copy link
Copy Markdown
Collaborator

@ChrisJBurns ChrisJBurns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Agent Consensus Review

Agents consulted: orchestrator only (single-agent review — 13-line fix didn't warrant parallel specialist review)

Consensus Summary

# Finding Consensus Severity Action
1 PR title uses conventional-commit prefix n/a INFO Optional

Overall

Tight, correct fix. The root-cause analysis is precise: Go map iteration is randomized, env list ordering feeds the K8s pod template hash, and any reorder triggers a StatefulSet rollout. Sorting keys before iteration is the right remedy and matches prior fixes (#4783, #3450).

I verified the other map iteration in this file (containerLabels at client.go:1301) writes into a destination map and is correctly left alone. The regression test asserts against an explicit sorted slice — documenting the contract rather than just stability — and runs 11 iterations to cover Go's map randomization variance.

The only nit is the title: .claude/rules/pr-creation.md says PR titles should not use conventional-commit prefixes (fix(...)). Recent history is mixed and this isn't strictly enforced, so non-blocking.


Generated with Claude Code

@github-actions github-actions Bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels May 3, 2026
@ChrisJBurns
Copy link
Copy Markdown
Collaborator

@nalditopr Thanks for this! I approved but the linter is having issues because of the complexity of the method, you able to put the env var sorting into a separate function?

@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.53%. Comparing base (8c90184) to head (22875ee).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5064      +/-   ##
==========================================
- Coverage   67.53%   67.53%   -0.01%     
==========================================
  Files         601      601              
  Lines       61093    61097       +4     
==========================================
- Hits        41262    41259       -3     
- Misses      16714    16719       +5     
- Partials     3117     3119       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCPServer StatefulSet template flaps due to non-deterministic env var ordering

2 participants