Skip to content

Preserve runconfig-checksum on pod template overrides#5149

Open
jhrozek wants to merge 2 commits intomainfrom
worktree-operator-annotations-fix
Open

Preserve runconfig-checksum on pod template overrides#5149
jhrozek wants to merge 2 commits intomainfrom
worktree-operator-annotations-fix

Conversation

@jhrozek
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek commented May 1, 2026

Summary

When a user populates Spec.ResourceOverrides.ProxyDeployment.PodTemplateMetadataOverrides.Annotations on an MCPServer (e.g. for Vault Agent injection), deploymentForMCPServer was passing the deployment-level annotations map (typically empty) instead of the pod-template annotations map (which already carried the runconfig-checksum) as the precedence base when merging in the user overrides. Because ctrlutil.MergeAnnotations is first-arg-wins, the checksum was silently dropped from the pod template, breaking the rolling-restart-on-RunConfig-change mechanism for any setup using pod-template overrides. The same asymmetry caused the drift-checker to disagree with the constructor on every reconcile, leaving the operator in a perpetual r.Update loop.

This change flips one variable so the merge starts from the pod-template annotations map. The MCPRemoteProxy sibling controller already merged correctly; this aligns MCPServer with it and with its own drift-checker. Three layers of regression tests are added so the bug class cannot return.

Fixes #5148

Type of change

  • Bug fix

Test plan

  • Unit tests (task test) — cmd/thv-operator/controllers is green, including the new tests. There is one pre-existing, unrelated failure in pkg/workloads/manager_test.go (TestDefaultManager_ListWorkloadsInGroup/workloads_with_empty_group_names) that reproduces with this PR's diff reverted, so it is not introduced here.
  • Linting (task lint-fix) — 0 issues.

API Compatibility

  • This PR does not break the v1beta1 API.

Changes

File Change
cmd/thv-operator/controllers/mcpserver_controller.go One-token fix in deploymentForMCPServer: pass deploymentTemplateAnnotations (which already carries the runconfig-checksum) instead of deploymentAnnotations (deployment-level, typically empty) as the precedence base for MergeAnnotations.
cmd/thv-operator/controllers/mcpserver_resource_overrides_test.go Add expectedPodTemplateAnns column to the table-driven test; add TestDeploymentForMCPServer_PodTemplateOverridesPreserveRunConfigChecksum (focused regression) and TestDeploymentNeedsUpdate_StableAfterBuildWithPodTemplateOverrides (constructor/comparator parity guardrail).

Does this introduce a user-facing change?

Yes — bug fix:

  • Users with PodTemplateMetadataOverrides.Annotations set will now correctly see proxy pods roll on RunConfig changes (telemetry, inline OIDC, external auth refs). Previously these edits silently failed to reach the running proxy.
  • Deployment-level annotations supplied via ProxyDeployment.Annotations no longer leak into the pod template. No documentation, example, or test in this repository placed sidecar-injection annotations (Vault Agent, Linkerd, Istio) at the deployment level, so no documented user pattern is regressed. Users who need the same key on both levels can set both fields independently.

Implementation plan

Approved implementation plan (AI-assisted)

Plan summary:

  1. One-token source fix in deploymentForMCPServer to align with the existing comparator and the MCPRemoteProxy sibling pattern.
  2. Extend the table-driven test so every case asserts on deployment.Spec.Template.Annotations. Catches missing-checksum and deployment-level-leak regressions in one place.
  3. Focused regression test named after the contract it pins (checksum survives, user override survives, no extras).
  4. Constructor/comparator parity test that calls deploymentForMCPServer and immediately calls deploymentNeedsUpdate with the same checksum, asserting no drift. Long-term guardrail against the asymmetry that caused the perpetual r.Update loop.

Each step was implemented separately and reviewed in parallel by go-expert and kubernetes-go-expert agents (Opus). Reviewers verified:

  • The pre-fix code had a real, demonstrable bug (verified by tracing MergeAnnotations first-arg-wins semantics through the constructor and the comparator).
  • Post-fix behavior aligns with the existing comparator, the MCPRemoteProxy sibling, and Kubernetes convention (Deployment-level vs pod-template annotations are intentionally separate scopes).
  • The Vault integration is unaffected: examples/operator/vault/mcpserver-github-with-vault.yaml and docs/arch/04-secrets-management.md both place / recommend Vault annotations under podTemplateMetadataOverrides.annotations — the field unaffected by the precedence-base swap.

The four implementation commits were squashed for merge.

Special notes for reviewers

The fix is one token; the bulk of the diff is regression coverage. The drift-checker deploymentNeedsUpdate (~line 1797 in the same file) was already merging in the correct order — the constructor was the outlier. The MCPRemoteProxyReconciler.buildPodTemplateMetadata sibling at cmd/thv-operator/controllers/mcpremoteproxy_deployment.go:328 is the existing correct model.

Optional follow-up out of scope for this PR: ProxyDeploymentOverrides.PodTemplateMetadataOverrides in cmd/thv-operator/api/v1beta1/mcpserver_types.go:373 has no godoc comment (the equivalent in embeddingserver_types.go:155 does). Adding a comment that explicitly directs users to put sidecar-injection annotations here, not in the embedded Annotations field, would close a documentation gap surfaced by the analysis.

Generated with Claude Code

deploymentForMCPServer merged user-supplied
PodTemplateMetadataOverrides.Annotations onto the wrong base
map: it passed deploymentAnnotations (the Deployment-level
overrides, typically empty) instead of
deploymentTemplateAnnotations (which already had the
runconfig-checksum stamped on it). Because
ctrlutil.MergeAnnotations is "first-arg-wins on conflict", the
checksum was silently dropped from the pod template and any
Deployment-level annotations leaked onto pods.

Two user-visible symptoms followed:

1. Proxy pods stopped rolling on RunConfig changes whenever
   PodTemplateMetadataOverrides.Annotations was set. Without
   the checksum on the pod template, Kubernetes had no signal
   to recreate pods, so telemetry / inline-OIDC / external-
   auth-ref edits landed in the ConfigMap on disk but never
   reached the running proxy. Vault Agent users were the
   canonical population affected, since vault.hashicorp.com/*
   keys live in this override field.

2. The drift-checker deploymentNeedsUpdate built the expected
   pod-template annotations correctly, so it disagreed with
   the constructor on every reconcile. The operator hot-looped
   on r.Update without converging.

The fix flips one variable so the merge starts from
deploymentTemplateAnnotations. Add three layers of test
coverage in mcpserver_resource_overrides_test.go:

  - an expectedPodTemplateAnns column on the table-driven
    resource-overrides test so every case asserts on
    deployment.Spec.Template.Annotations
  - a focused regression test pinning the no-leakage contract
    (checksum survives, user override survives, no extras)
  - a parity test asserting deploymentNeedsUpdate reports no
    drift immediately after deploymentForMCPServer with the
    same checksum and overrides

Why this is safe:

  - The comparator deploymentNeedsUpdate in the same file was
    already merging in this order, and the sibling controller
    MCPRemoteProxyReconciler.buildPodTemplateMetadata uses
    this exact pattern. The fix brings the buggy constructor
    in line with both.
  - Kubernetes convention treats Deployment-level and
    pod-template annotations as separate scopes; the CRD
    exposes them as separate user-facing fields
    (ProxyDeployment.Annotations vs
    ProxyDeployment.PodTemplateMetadataOverrides.Annotations).
    No documentation, example (including
    examples/operator/vault/mcpserver-github-with-vault.yaml),
    or test in this repository places sidecar-injection
    annotations at the deployment level, so no documented
    user is regressed.
  - MergeAnnotations keeps the first-argument map winning on
    conflict, so a user cannot accidentally overwrite the
    runconfig-checksum from their override map.

Fixes #5148
@github-actions github-actions Bot added the size/S Small PR: 100-299 lines changed label May 1, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.58%. Comparing base (8c90184) to head (2ed1649).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5149      +/-   ##
==========================================
+ Coverage   67.53%   67.58%   +0.04%     
==========================================
  Files         601      601              
  Lines       61093    61093              
==========================================
+ Hits        41262    41288      +26     
+ Misses      16714    16686      -28     
- Partials     3117     3119       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCPServer: runconfig-checksum dropped from proxy pod template when PodTemplateMetadataOverrides.Annotations is set

2 participants