Add StorageVersionMigrator controller to clean up storedVersions#5011
Add StorageVersionMigrator controller to clean up storedVersions#5011ChrisJBurns wants to merge 9 commits intomainfrom
Conversation
Prepares the operator manager for an upcoming controller that watches CustomResourceDefinition objects. No runtime behaviour change yet. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the kubebuilder marker
+kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true
to every v1beta1 root type and regenerates the CRD manifests. The label
is the opt-in signal for an upcoming StorageVersionMigrator controller
that will reconcile each CRD's status.storedVersions down to the current
storage version. List types are intentionally not labelled — the label
applies to the CRD itself, which is keyed on the root type.
Part of #4969.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces a new operator controller that watches opted-in toolhive.stacklok.dev CRDs and reconciles each one's status.storedVersions down to the current storage version, so a future release can drop deprecated versions (e.g. v1alpha1) from spec.versions without orphaning etcd objects. Mechanism: for each CR of a managed kind, issue a metadata-only Server-Side Apply against the /status subresource with a dedicated field manager (thv-storage-version-migrator). SSA on /status bypasses the admission chain (ToolHive has validating webhooks on several root types) while still forcing the API server to re-encode each object at the current storage version. Once every CR succeeds, the CRD's status.storedVersions is patched to [<currentStorageVersion>] with an optimistic-lock merge. Scoping: CRDs must carry both the auto-migrate label (set via kubebuilder marker in a previous commit) and a name under toolhive.stacklok.dev. Wired into main.go behind the ENABLE_STORAGE_VERSION_MIGRATOR feature flag (default on). Helm chart surface + envtest coverage land in subsequent commits. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Seven table-driven scenarios exercising the reconciler against a real
kube-apiserver + etcd via envtest:
- noop when storedVersions is already [storageVersion]
- happy path: [v1alpha1,v1beta1] -> [v1beta1] with CRs re-stored
- foreign-group CRDs are skipped
- unlabelled toolhive CRDs are skipped
- fallback to main-resource SSA when /status is absent
- pagination across the continue-token loop (verified via a counting
APIReader wrapper since metadata-only SSAs leave no managedFields
fingerprint to assert on)
- partial failure leaves storedVersions untouched (failure injected
via a client wrapper that fails Apply for one specific CR key)
Each test builds its own CRD fixture and calls Reconcile directly; the
suite does not start a manager so tests are deterministic and free from
races with a background controller. Runs via `task operator-test-integration`
alongside the existing virtualmcp suite.
Part of #4969.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guards against the main footgun of the opt-in-label design for storage-version migration: if a new CRD root type is added to cmd/thv-operator/api/v1beta1/ without the migrate-marker, the StorageVersionMigrator controller silently excludes it. The problem surfaces only when a future release tries to drop a deprecated version — at which point it is far too late to fix. The test scans every root type (marker block contains both +kubebuilder:object:root=true and +kubebuilder:storageversion) and fails unless the block also contains either the migrate marker or an explicit +thv:storage-version-migrator:exclude sibling marker. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds operator.features.storageVersionMigrator (default true) plus the ENABLE_STORAGE_VERSION_MIGRATOR env var wiring in the deployment template, mirroring the existing server/registry/virtualMCP pattern. Helm docs regenerated. Admins who run kube-storage-version-migrator externally can disable the in-operator controller by setting this flag to false. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers what the controller does, the opt-in label contract, how to disable the controller operator-wide, the per-CRD escape hatch and its interaction with GitOps, how the migrator interacts with future version-removal releases, and the required RBAC. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jhrozek
left a comment
There was a problem hiding this comment.
A few things I'd like to flag. The two bigger ones are about the SSA body being empty and about swallowed conflicts before we trim storedVersions, curious whether those are intentional. The rest are smaller.
| patch.SetUID(original.GetUID()) // UID mismatch → typed conflict on races | ||
| patch.SetResourceVersion(original.GetResourceVersion()) // RV mismatch → typed conflict on races | ||
|
|
||
| applyConfig := client.ApplyConfigurationFromUnstructured(patch) |
There was a problem hiding this comment.
The SSA body here only carries name/namespace/uid/resourceVersion, no fields inside spec or status. On /status SSA that's effectively an empty apply, which the apiserver typically short-circuits without rewriting etcd. If that's what's happening then we trim storedVersions while the underlying bytes haven't actually been re-encoded, which is the exact footgun this controller is meant to prevent.
For reference, kube-storage-version-migrator does an Update with the full re-decoded object precisely to force the storage write. Another option is to include a single managed field here (e.g. an annotation like toolhive.stacklok.dev/last-restored-at) so the apply is non-empty.
Worth checking: is the metadata-only SSA actually forcing a re-encode in practice? The existing envtest asserts storedVersions converges but can't distinguish "bytes rewritten" from "storedVersions lied".
There was a problem hiding this comment.
Empirically confirmed — wrote a probe that runs the exact metadata-only /status SSA against envtest and snapshots resourceVersion before/after. Result: rvBefore=201 rvAfter=201 changed=false. The apiserver short-circuits the empty apply, etcd is never re-encoded, and the controller's correctness claim was hollow.
Fixed in dad30c9 by switching to Get + Update on the main resource, mutating the toolhive.stacklok.dev/storage-version-migrated-at annotation to force a real diff. This matches the upstream kube-storage-version-migrator pattern. Plain Status().Update(ctx, live) was tried first and is also insufficient — Update with no diff is elided too, so the annotation toggle is load-bearing.
Trade-off: the Update goes through the main resource so admission webhooks see the write. Acknowledged in the controller docstring and docs/operator/storage-version-migration.md. Webhooks that need to ignore migration-only writes can branch on the annotation in the diff. The original /status webhook-bypass benefit was always notional since the SSA wasn't writing etcd at all.
The HappyPath envtest now snapshots each CR's resourceVersion before reconcile and asserts it has bumped — empirical proof the bug is fixed.
| if err := r.restoreOne(ctx, gvk, u, hasStatusSubresource); err != nil { | ||
| if apierrors.IsNotFound(err) || apierrors.IsConflict(err) { | ||
| logger.V(1).Info("skip CR — deleted or updated elsewhere", | ||
| "object", client.ObjectKeyFromObject(u), "err", err) |
There was a problem hiding this comment.
IsNotFound is fine to swallow, but I'm not sure IsConflict is. The comment says "storage is already fresh", but a concurrent writer using the deprecated API version (still served during the deprecation window) produces a write encoded at the old version. If our SSA races with that writer and comes back Conflict, the CR is still stored at the old version, we silently skip it, and then patchStoredVersions trims storedVersions to [storageVersion]. That leaves storedVersions claiming we're clean while there's still an old-version-encoded object in etcd.
Options: retry the apply on Conflict (bounded), or track a "swallowed-conflict" flag and refuse to trim storedVersions on the pass where any occurred. Either way I don't think we should treat Conflict the same as NotFound here.
There was a problem hiding this comment.
Agreed — fixed in dad30c9. restoreCRs now tracks per-pass conflict count in a function-local counter. IsConflict is logged at V(1) and skipped per-CR (still treated as transient), but if any conflicts occurred and no other errors did, the function returns a sentinel errMigrationRetriedDueToConflicts. Reconcile propagates it without calling patchStoredVersions, so storedVersions stays at [v1alpha1, v1beta1] until a conflict-free pass succeeds.
Added a new test ConflictDuringRestoreLeavesStoredVersionsUntouched that injects an apierrors.NewConflict(...) for one CR via a wrapping client, asserts the sentinel error is returned, asserts storedVersions is untouched, then drops the injection and verifies the next reconcile completes cleanly.
IsNotFound is still a clean swallow with no error — the object is gone, it can't be at the old version anymore.
| } | ||
| } | ||
|
|
||
| func (c *migrationCache) has(crdName string, uid apitypes.UID, resourceVersion string) bool { |
There was a problem hiding this comment.
The cache only evicts expired entries when has() is called with the same key. Once a CR is deleted from the cluster its UID never comes up in a list again, so its entry stays resident until the operator restarts. For an operator that runs for weeks with CR churn (ephemeral MCPServers, GitOps recreations) the map grows unbounded, each entry is small but never released.
Is this an accepted tradeoff given the 1h TTL and typical restart cadence? If so, happy to drop this. Otherwise a background sweep via mgr.Add(RunnableFunc(...)) or an expirable LRU would bound it.
There was a problem hiding this comment.
Right, the lazy eviction never frees entries for deleted CRs. Fixed in dad30c9 by adding a periodic background sweep: a manager.RunnableFunc registered in SetupWithManager ticks every 10m (configurable via the new CacheGCInterval field) and calls cache.gc() to evict expired entries unconditionally. Ticker is Stop()'ed via defer; the runnable returns on ctx.Done() so it shuts down cleanly with the manager.
Went with hand-rolled GC over k8s.io/utils/lru to keep the public API surface small. Easy to swap later if we want capacity-bounding too.
| servedCount++ | ||
| } | ||
| } | ||
| return servedCount > 1 |
There was a problem hiding this comment.
I think this branch produces the opposite of what the doc comment describes. When storedVersions is already [storageVersion] and more than one version is still served (the common long-lived state during a multi-release deprecation window), this returns true, which forces a full list pass on every CRD event. The migrationCache suppresses the per-CR SSAs but the list call itself still fires, and a restart clears the cache.
If we want to re-scan periodically during deprecation as a defensive measure, worth stating so and acknowledging the cost. If not, return len(stored) != 1 || stored[0] != storageVersion on its own should be sufficient, new writes at the old version would append to storedVersions and re-trigger migration via the first branch.
There was a problem hiding this comment.
Good catch — you're right that under None-strategy with identical schemas, normal writers cannot reintroduce stale versions to storedVersions, so the defensive re-scan has no scenario to defend against. Simplified in dad30c9 to your suggested form: return len(stored) != 1 || stored[0] != storageVersion. Doc comment updated to be honest about the (lack of) defensive value.
|
|
||
| //+kubebuilder:rbac:groups=apiextensions.k8s.io,resources=customresourcedefinitions,verbs=get;list;watch | ||
| //+kubebuilder:rbac:groups=apiextensions.k8s.io,resources=customresourcedefinitions/status,verbs=update;patch | ||
| //+kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=*,verbs=get;list;patch |
There was a problem hiding this comment.
Minor, just wanted to surface this for sign-off. The wildcard grants get/list/patch across every current and future kind in the toolhive group, not just the opted-in ones. I get that the opted-in set isn't known at codegen time, so the alternative (explicit per-kind rules) is awkward. Probably fine, but worth a brief comment on the marker noting the intentional wildcard plus the isManagedCRD gate as defense-in-depth.
There was a problem hiding this comment.
Added the comment in dad30c9 — explicit note above the kubebuilder marker block explaining the wildcard is intentional (opt-in set isn't known at codegen time), with isManagedCRD documented as the runtime gate forming defence in depth.
While I was there: the verbs themselves changed too — patch → update for both the main resource and /status to match the new write path (Update + annotation toggle, see your other comment). RBAC regenerated via task operator-manifests.
| // Per-kind escape hatch: remove the label from the CRD (emergency only — will | ||
| // be re-applied by GitOps / helm upgrade). | ||
| type StorageVersionMigratorReconciler struct { | ||
| client.Client // cached reads for CRs (unused in this reconciler, kept for kubebuilder convention) |
There was a problem hiding this comment.
Nit: the comment says "unused in this reconciler, kept for kubebuilder convention", but the field is used on lines 280 (r.Client.Status().Apply), 285 (r.Apply), and 301 (r.Client.Status().Patch). A maintainer reading the comment might assume they can drop it. Maybe change to something like "used for SSA writes and status patch; reads go through APIReader".
There was a problem hiding this comment.
Yep, embarrassing comment drift — the field is used at three call sites. Fixed in dad30c9:
client.Client // used for /status writes (CR re-stores, CRD storedVersions patch); reads go through APIReader(Slightly tweaked from your suggestion to mention CRD storedVersions, since the reconciler also uses Client.Status().Patch on the CRD itself.)
| var _ = AfterSuite(func() { | ||
| By("tearing down envtest") | ||
| cancel() | ||
| time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
The suite doesn't start a controller manager per the docstring, so I'm not sure what this sleep is waiting for. cancel() doesn't block and there are no background goroutines that need to drain. If it turns out envtest.Stop() flakes without it, there's usually a proper fix (draining a specific channel) rather than a fixed delay. Otherwise it can probably just go.
There was a problem hiding this comment.
Right, dead code copy-pasted from virtualmcp/suite_test.go (which actually has a manager goroutine). Removed in dad30c9 along with the now-unused time import.
…n toggle Empirical verification (probe test against envtest) confirmed the reviewer's hypothesis: metadata-only Server-Side Apply on /status is short-circuited by the apiserver — resourceVersion does not bump and etcd is not re-encoded. The "force a re-encode" claim the original implementation rested on was hollow. This commit replaces the SSA path with the kube-storage-version-migrator pattern: Get the live object, toggle a timestamp annotation to force a real diff, then Update. Trade-off acknowledged in the controller docstring and user docs: the Update goes through the main resource so admission webhooks see it. Webhooks that need to ignore migration-only writes can branch on the presence of toolhive.stacklok.dev/storage-version-migrated-at in the diff. The original /status webhook-bypass benefit was always notional — the SSA wasn't writing etcd at all. Other fixes from the same review pass: - Track per-pass IsConflict count in restoreCRs; if any conflicts were swallowed and no other errors occurred, return a sentinel error so Reconcile leaves storedVersions untouched (the post-conflict state of the affected object is unverified). Next reconcile retries cleanly. New ConflictDuringRestoreLeavesStoredVersionsUntouched test covers this. - Bound migrationCache memory: register a manager.RunnableFunc that walks the cache every 10m and evicts expired entries. Prior code only evicted lazily on key re-query, which leaked entries for deleted CRs. - Simplify isMigrationNeeded to a single line. The servedCount > 1 branch was forcing list passes during deprecation windows for no actual benefit (None conversion + identical schemas means writers cannot reintroduce stale storedVersions). - Document the wildcard CR RBAC as defence-in-depth alongside the isManagedCRD runtime gate. - Fix the stale "(unused in this reconciler)" comment on the embedded Client field — it is used for CR Updates and the CRD status patch. - Drop a 100ms time.Sleep in AfterSuite that was dead code copied from virtualmcp's suite (which has a manager goroutine to drain; we don't). HappyPath now snapshots each CR's resourceVersion before reconcile and asserts it has bumped after — empirical proof the underlying re-encode actually happened. All 8 envtest cases pass; lint clean. Refs PR #5011 review by @jhrozek. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The annotation toggle was a workaround for a misdiagnosed problem.
Earlier in this branch we observed empty SSAs being elided and
generalised it to "all unmodified writes elide" — and added a
toolhive.stacklok.dev/storage-version-migrated-at annotation toggle
to force a real diff per write. A second-opinion investigation
pointed out the elision check happens AFTER storage encoding: the
apiserver re-encodes the request body at the current storage version,
then byte-compares against etcd. For the actual migration scenario
(CR stamped at an older apiVersion in etcd, storage version now newer)
the encoded apiVersion differs, the comparison fails, and the write
proceeds. Verified empirically: a CR stored at v1alpha1 with storage
flipped to v1beta1 has plain Get+Update bump RV from 202 to 204; a CR
already at v1beta1 has plain Get+Update elide cleanly. Both behaviours
are correct — the elision is exactly the no-migration-needed case.
Changes:
- restoreOne: plain Get + Update of the unmodified live object. No
annotation, no timestamp, no MigrationTimestampAnnotation constant.
- Controller docstring rewritten to describe the actual mechanism
(encode-then-compare elision) and cite ksvm#65 for posterity.
- HappyPath envtest loosened: per-CR resourceVersion bump assertions
are dropped (already-clean CRs correctly elide; checking RV bumps
on them was an artefact of the annotation-induced writes). The
storedVersions convergence assertion remains as the contract test.
- New cross-version envtest "re-encodes CRs that are stored at a
prior storage version" exercises the real migration scenario:
install CRD with v1alpha1 storage, create CR, flip storage to
v1beta1, run reconciler, assert storedVersions trims and the CR's
RV bumped (proving the apiserver actually rewrote etcd).
- User docs updated to drop annotation references and describe the
real mechanism. Reference implementation citation switched from
cluster-api/crdmigrator to kube-storage-version-migrator (the
actual upstream SIG tool we share semantics with).
- FallsBackWhenNoStatusSubresource test removed: the SSA path it
exercised no longer exists. Plain Update on a CRD without /status
is just a normal Update; covered by the happy-path spec.
8/8 envtest cases pass. Lint clean. No RBAC drift (verbs already
correct).
Refs PR #5011 review by @jhrozek.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Design correction — withdrawing the "upstream is broken" claimI owe a correction on the design pivot from The corrected understanding (verified against ksvm#65 and confirmed empirically against our envtest in
My earlier probes only tested the second case (CR already at storage version), saw the elision, and overgeneralised. The probe in Practical consequence for this PR: the annotation toggle ( Apologies to the upstream maintainers of |
Summary
Adds a new
StorageVersionMigratorcontroller to the ToolHive operator that watches opted-intoolhive.stacklok.devCRDs and reconciles each one'sstatus.storedVersionsdown to the current storage version, so a future release can drop deprecated API versions (e.g.v1alpha1) fromspec.versionswithout orphaning etcd objects.Closes #4969
Medium level
cmd/thv-operator/controllers/storageversionmigrator_controller.go. Watchesapiextensions.k8s.io/v1/CustomResourceDefinition(metadata-only), filtered by opt-in label + group. Per-CRD: readsspec.versionsto find the storage version, lists every CR via paginated unstructured list, then for each CR does a plainGet+Updateof the unmodified live object. The apiserver re-encodes the request body at the current storage version and byte-compares against etcd; when the CR was stored at a differentapiVersion(the migration scenario) the comparison fails and the write proceeds, re-encoding the object. When the CR is already at the current storage version, the comparison matches and the apiserver harmlessly elides the write — there was nothing to migrate. This matches upstreamkube-storage-version-migrator's mechanism. Once every CR has been processed, patchesstatus.storedVersionsvia optimistic-lock merge. 1h migration cache (with periodic background GC) dedupes Get+Update RPCs across reconciles.+kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=truemarker, baked into the generated CRD YAML bytask operator-manifests. New CRDs opt in by adding the same marker; CRDs that shouldn't be migrated omit it. A CI test (TestV1beta1TypesMarkerCoverage) fails the build if a root type is added without either this marker or an explicit+thv:storage-version-migrator:excludesibling.operator.features.storageVersionMigratorin the helm chart (defaulttrue) wires toENABLE_STORAGE_VERSION_MIGRATORfollowing the existingENABLE_SERVER/ENABLE_REGISTRY/ENABLE_VMCPpattern. Admins who prefer to runkube-storage-version-migratorexternally can disable it.IsConflict(concurrent writer), the controller refuses to trimstoredVersionson that pass and returns a sentinel error so the next reconcile retries cleanly. The post-conflict state of the affected object is unverified, so trimming would be reasoning from an unproven premise.cmd/thv-operator/test-integration/storageversionmigrator/with 8 scenarios: noop-when-clean, happy-path-trims-storedVersions, re-encodes CRs that are stored at a prior storage version (the load-bearing test that proves the migration mechanism actually rewrites etcd — installs a CRD withv1alpha1storage, creates a CR, flips storage tov1beta1, asserts the CR'sresourceVersionbumped after reconcile), partial-failure leaves storedVersions untouched, conflict-during-restore leaves storedVersions untouched, skips-foreign-CRDs, skips-unlabelled-CRDs, pagination (continue-token loop verified via a counting APIReader wrapper). Each test installs its own CRD fixture and calls Reconcile directly — no background manager, no timing races.docs/operator/storage-version-migration.mdcovers the actual write mechanism, webhook interaction, the opt-in label contract, feature-flag opt-out, the admin per-CRD escape hatch, and coordination with the futurev1alpha1-removal release.Low level
cmd/thv-operator/controllers/storageversionmigrator_controller.gocmd/thv-operator/controllers/marker_coverage_test.goapi/v1beta1/*_types.goand fails if a root type lacks either the migrate marker or an explicit exclude markercmd/thv-operator/test-integration/storageversionmigrator/suite_test.gocmd/thv-operator/test-integration/storageversionmigrator/controller_test.gocmd/thv-operator/main.goapiextensions/v1scheme, addfeatureStorageVersionMigratorconst,setupStorageVersionMigrator()wiringcmd/thv-operator/api/v1beta1/*_types.go(12 files)+kubebuilder:metadata:labels=...marker to each root typedeploy/charts/operator-crds/files/crds/*.yaml+templates/*.yaml(12 × 2)metadata.labelsdeploy/charts/operator/templates/clusterrole/role.yamlcustomresourcedefinitionsget/list/watch +/statusupdate;patch +toolhive.stacklok.dev/*get/list/update +*/statusupdatedeploy/charts/operator/values.yamloperator.features.storageVersionMigrator: truedeploy/charts/operator/templates/deployment.yamlENABLE_STORAGE_VERSION_MIGRATORenv vardeploy/charts/operator/README.mdtask helm-docsdocs/operator/storage-version-migration.mdType of change
Test plan
task lint-fix— 0 issuestask build— cleango test $(go list ./cmd/thv-operator/... | grep -v /test-integration)) — all greentask operator-test-integrationpath, or directly withKUBEBUILDER_ASSETS=$(setup-envtest use 1.31.0 -p path) go test ./cmd/thv-operator/test-integration/storageversionmigrator/)task operator-manifests+task helm-docsproduce no drift against committed statehelm template deploy/charts/operatorrenders the new env var correctlySpecial notes for reviewers
The write mechanism in this PR has been through three iterations during review — final state described below; the journey is captured in the inline comment threads.
The current implementation is plain
Get+Updateof the unmodified CR — exactly what upstreamkube-storage-version-migratordoes. The Kubernetes apiserver's storage layer re-encodes the request body at the current storage version, then byte-compares against etcd. When the CR was originally stored at an olderapiVersion(the actual migration scenario), the encoded apiVersion stamp differs from etcd's, the comparison fails, and the write proceeds — re-encoding the object. When the CR is already at the current storage version, the bytes match and the apiserver elides the write harmlessly. This is the upstream maintainers' explicit reasoning, captured at kube-storage-version-migrator#65.A new envtest case (
re-encodes CRs that are stored at a prior storage version) proves the migration mechanism empirically: install a CRD withv1alpha1as storage, create a CR (etcd writesapiVersion: v1alpha1bytes), flip storage tov1beta1, run the reconciler, and assert the CR'sresourceVersionbumped. Observed bump 202→204; the elision sanity check on a same-storage-version CR shows RV stays at 204 as expected.PR size: ~450 production lines across 15 files, slightly over the 400-line / 10-file convention. 12 of those files are 1-line kubebuilder marker additions in
api/v1beta1/*_types.go; the substantive Go work is 3 files (controller, main.go wiring, helm values/deployment). I did not split because the controller + opt-in labels + marker-coverage CI test are one cohesive feature — splitting would mean landing unused labels or an unused controller mid-stack.Why this PR must ship at least one release before
v1alpha1is removed: the migrator needs time to walk every cluster'sstoredVersionsduring the deprecation window. Shipping it in the same release as the version removal provides zero benefit because clusters jumping straight to that release won't have run the migrator yet. Documented indocs/operator/storage-version-migration.md.Follow-ups not in scope for this PR:
🤖 Generated with Claude Code