Add StorageVersionMigrator controller to clean up storedVersions by ChrisJBurns · Pull Request #5011 · stacklok/toolhive

ChrisJBurns · 2026-04-22T15:18:13Z

Summary

Adds a new StorageVersionMigrator controller to the ToolHive operator that watches opted-in toolhive.stacklok.dev CRDs and reconciles each one's status.storedVersions down to the current storage version, so a future release can drop deprecated API versions (e.g. v1alpha1) from spec.versions without orphaning etcd objects.

Closes #4969

Medium level

Controller — cmd/thv-operator/controllers/storageversionmigrator_controller.go. Watches apiextensions.k8s.io/v1/CustomResourceDefinition (metadata-only), filtered by opt-in label + group. Per-CRD: reads spec.versions to find the storage version, lists every CR via paginated unstructured list, then for each CR does a plain Get + Update of the unmodified live object. The apiserver re-encodes the request body at the current storage version and byte-compares against etcd; when the CR was stored at a different apiVersion (the migration scenario) the comparison fails and the write proceeds, re-encoding the object. When the CR is already at the current storage version, the comparison matches and the apiserver harmlessly elides the write — there was nothing to migrate. This matches upstream kube-storage-version-migrator's mechanism. Once every CR has been processed, patches status.storedVersions via optimistic-lock merge. 1h migration cache (with periodic background GC) dedupes Get+Update RPCs across reconciles.
Opt-in label — every v1beta1 root type carries a +kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true marker, baked into the generated CRD YAML by task operator-manifests. New CRDs opt in by adding the same marker; CRDs that shouldn't be migrated omit it. A CI test (TestV1beta1TypesMarkerCoverage) fails the build if a root type is added without either this marker or an explicit +thv:storage-version-migrator:exclude sibling.
Feature flag — operator.features.storageVersionMigrator in the helm chart (default true) wires to ENABLE_STORAGE_VERSION_MIGRATOR following the existing ENABLE_SERVER / ENABLE_REGISTRY / ENABLE_VMCP pattern. Admins who prefer to run kube-storage-version-migrator externally can disable it.
Conflict-safe trim — if any per-CR re-store hits an IsConflict (concurrent writer), the controller refuses to trim storedVersions on that pass and returns a sentinel error so the next reconcile retries cleanly. The post-conflict state of the affected object is unverified, so trimming would be reasoning from an unproven premise.
Tests — envtest suite at cmd/thv-operator/test-integration/storageversionmigrator/ with 8 scenarios: noop-when-clean, happy-path-trims-storedVersions, re-encodes CRs that are stored at a prior storage version (the load-bearing test that proves the migration mechanism actually rewrites etcd — installs a CRD with v1alpha1 storage, creates a CR, flips storage to v1beta1, asserts the CR's resourceVersion bumped after reconcile), partial-failure leaves storedVersions untouched, conflict-during-restore leaves storedVersions untouched, skips-foreign-CRDs, skips-unlabelled-CRDs, pagination (continue-token loop verified via a counting APIReader wrapper). Each test installs its own CRD fixture and calls Reconcile directly — no background manager, no timing races.
Docs — docs/operator/storage-version-migration.md covers the actual write mechanism, webhook interaction, the opt-in label contract, feature-flag opt-out, the admin per-CRD escape hatch, and coordination with the future v1alpha1-removal release.

Low level

File	Change
`cmd/thv-operator/controllers/storageversionmigrator_controller.go`	Reconciler, SetupWithManager, migration cache + periodic GC, conflict-safe trim, RBAC markers
`cmd/thv-operator/controllers/marker_coverage_test.go`	Scans `api/v1beta1/*_types.go` and fails if a root type lacks either the migrate marker or an explicit exclude marker
`cmd/thv-operator/test-integration/storageversionmigrator/suite_test.go`	envtest bootstrap (no manager; direct Reconcile calls)
`cmd/thv-operator/test-integration/storageversionmigrator/controller_test.go`	8 ginkgo scenarios incl. cross-version re-encode empirical proof
`cmd/thv-operator/main.go`	Register `apiextensions/v1` scheme, add `featureStorageVersionMigrator` const, `setupStorageVersionMigrator()` wiring
`cmd/thv-operator/api/v1beta1/*_types.go` (12 files)	Add `+kubebuilder:metadata:labels=...` marker to each root type
`deploy/charts/operator-crds/files/crds/.yaml` + `templates/.yaml` (12 × 2)	Regenerated CRDs — label now present in each CRD's `metadata.labels`
`deploy/charts/operator/templates/clusterrole/role.yaml`	Regenerated RBAC — adds `customresourcedefinitions` get/list/watch + `/status` update;patch + `toolhive.stacklok.dev/` get/list/update + `/status` update
`deploy/charts/operator/values.yaml`	New `operator.features.storageVersionMigrator: true`
`deploy/charts/operator/templates/deployment.yaml`	New `ENABLE_STORAGE_VERSION_MIGRATOR` env var
`deploy/charts/operator/README.md`	Regenerated via `task helm-docs`
`docs/operator/storage-version-migration.md`	New user-facing docs

Type of change

Test plan

task lint-fix — 0 issues
task build — clean
Operator unit tests (go test $(go list ./cmd/thv-operator/... | grep -v /test-integration)) — all green
Envtest suite — 8/8 pass (task operator-test-integration path, or directly with KUBEBUILDER_ASSETS=$(setup-envtest use 1.31.0 -p path) go test ./cmd/thv-operator/test-integration/storageversionmigrator/)
Marker coverage test — passes against all 12 current v1beta1 root types
task operator-manifests + task helm-docs produce no drift against committed state
helm template deploy/charts/operator renders the new env var correctly

Special notes for reviewers

The write mechanism in this PR has been through three iterations during review — final state described below; the journey is captured in the inline comment threads.

The current implementation is plain Get + Update of the unmodified CR — exactly what upstream kube-storage-version-migrator does. The Kubernetes apiserver's storage layer re-encodes the request body at the current storage version, then byte-compares against etcd. When the CR was originally stored at an older apiVersion (the actual migration scenario), the encoded apiVersion stamp differs from etcd's, the comparison fails, and the write proceeds — re-encoding the object. When the CR is already at the current storage version, the bytes match and the apiserver elides the write harmlessly. This is the upstream maintainers' explicit reasoning, captured at kube-storage-version-migrator#65.

A new envtest case (re-encodes CRs that are stored at a prior storage version) proves the migration mechanism empirically: install a CRD with v1alpha1 as storage, create a CR (etcd writes apiVersion: v1alpha1 bytes), flip storage to v1beta1, run the reconciler, and assert the CR's resourceVersion bumped. Observed bump 202→204; the elision sanity check on a same-storage-version CR shows RV stays at 204 as expected.

PR size: ~450 production lines across 15 files, slightly over the 400-line / 10-file convention. 12 of those files are 1-line kubebuilder marker additions in api/v1beta1/*_types.go; the substantive Go work is 3 files (controller, main.go wiring, helm values/deployment). I did not split because the controller + opt-in labels + marker-coverage CI test are one cohesive feature — splitting would mean landing unused labels or an unused controller mid-stack.

Why this PR must ship at least one release before v1alpha1 is removed: the migrator needs time to walk every cluster's storedVersions during the deprecation window. Shipping it in the same release as the version removal provides zero benefit because clusters jumping straight to that release won't have run the migrator yet. Documented in docs/operator/storage-version-migration.md.

Follow-ups not in scope for this PR:

Optional chainsaw smoke test against a real Kind cluster (envtest covers the same behaviour at the apiserver level; chainsaw would add coverage for the helm chart wiring and admission-webhook compatibility specifically).
Audit existing operator webhooks (MCPServer, MCPGroup, VirtualMCPServer) for compatibility with migration-induced Updates. Migration writes go through normal admission flow on objects that need re-encoding; webhooks should already tolerate metadata-stable Updates but worth empirical validation in a kind cluster.

🤖 Generated with Claude Code

Prepares the operator manager for an upcoming controller that watches CustomResourceDefinition objects. No runtime behaviour change yet. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the kubebuilder marker +kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true to every v1beta1 root type and regenerates the CRD manifests. The label is the opt-in signal for an upcoming StorageVersionMigrator controller that will reconcile each CRD's status.storedVersions down to the current storage version. List types are intentionally not labelled — the label applies to the CRD itself, which is keyed on the root type. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Introduces a new operator controller that watches opted-in toolhive.stacklok.dev CRDs and reconciles each one's status.storedVersions down to the current storage version, so a future release can drop deprecated versions (e.g. v1alpha1) from spec.versions without orphaning etcd objects. Mechanism: for each CR of a managed kind, issue a metadata-only Server-Side Apply against the /status subresource with a dedicated field manager (thv-storage-version-migrator). SSA on /status bypasses the admission chain (ToolHive has validating webhooks on several root types) while still forcing the API server to re-encode each object at the current storage version. Once every CR succeeds, the CRD's status.storedVersions is patched to [<currentStorageVersion>] with an optimistic-lock merge. Scoping: CRDs must carry both the auto-migrate label (set via kubebuilder marker in a previous commit) and a name under toolhive.stacklok.dev. Wired into main.go behind the ENABLE_STORAGE_VERSION_MIGRATOR feature flag (default on). Helm chart surface + envtest coverage land in subsequent commits. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Seven table-driven scenarios exercising the reconciler against a real kube-apiserver + etcd via envtest: - noop when storedVersions is already [storageVersion] - happy path: [v1alpha1,v1beta1] -> [v1beta1] with CRs re-stored - foreign-group CRDs are skipped - unlabelled toolhive CRDs are skipped - fallback to main-resource SSA when /status is absent - pagination across the continue-token loop (verified via a counting APIReader wrapper since metadata-only SSAs leave no managedFields fingerprint to assert on) - partial failure leaves storedVersions untouched (failure injected via a client wrapper that fails Apply for one specific CR key) Each test builds its own CRD fixture and calls Reconcile directly; the suite does not start a manager so tests are deterministic and free from races with a background controller. Runs via `task operator-test-integration` alongside the existing virtualmcp suite. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Guards against the main footgun of the opt-in-label design for storage-version migration: if a new CRD root type is added to cmd/thv-operator/api/v1beta1/ without the migrate-marker, the StorageVersionMigrator controller silently excludes it. The problem surfaces only when a future release tries to drop a deprecated version — at which point it is far too late to fix. The test scans every root type (marker block contains both +kubebuilder:object:root=true and +kubebuilder:storageversion) and fails unless the block also contains either the migrate marker or an explicit +thv:storage-version-migrator:exclude sibling marker. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds operator.features.storageVersionMigrator (default true) plus the ENABLE_STORAGE_VERSION_MIGRATOR env var wiring in the deployment template, mirroring the existing server/registry/virtualMCP pattern. Helm docs regenerated. Admins who run kube-storage-version-migrator externally can disable the in-operator controller by setting this flag to false. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Covers what the controller does, the opt-in label contract, how to disable the controller operator-wide, the per-CRD escape hatch and its interaction with GitOps, how the migrator interacts with future version-removal releases, and the required RBAC. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jhrozek

A few things I'd like to flag. The two bigger ones are about the SSA body being empty and about swallowed conflicts before we trim storedVersions, curious whether those are intentional. The rest are smaller.

jhrozek · 2026-04-23T19:45:50Z

+	patch.SetUID(original.GetUID())                         // UID mismatch → typed conflict on races
+	patch.SetResourceVersion(original.GetResourceVersion()) // RV mismatch → typed conflict on races
+
+	applyConfig := client.ApplyConfigurationFromUnstructured(patch)


The SSA body here only carries name/namespace/uid/resourceVersion, no fields inside spec or status. On /status SSA that's effectively an empty apply, which the apiserver typically short-circuits without rewriting etcd. If that's what's happening then we trim storedVersions while the underlying bytes haven't actually been re-encoded, which is the exact footgun this controller is meant to prevent.

For reference, kube-storage-version-migrator does an Update with the full re-decoded object precisely to force the storage write. Another option is to include a single managed field here (e.g. an annotation like toolhive.stacklok.dev/last-restored-at) so the apply is non-empty.

Worth checking: is the metadata-only SSA actually forcing a re-encode in practice? The existing envtest asserts storedVersions converges but can't distinguish "bytes rewritten" from "storedVersions lied".

Empirically confirmed — wrote a probe that runs the exact metadata-only /status SSA against envtest and snapshots resourceVersion before/after. Result: rvBefore=201 rvAfter=201 changed=false. The apiserver short-circuits the empty apply, etcd is never re-encoded, and the controller's correctness claim was hollow.

Fixed in dad30c9 by switching to Get + Update on the main resource, mutating the toolhive.stacklok.dev/storage-version-migrated-at annotation to force a real diff. This matches the upstream kube-storage-version-migrator pattern. Plain Status().Update(ctx, live) was tried first and is also insufficient — Update with no diff is elided too, so the annotation toggle is load-bearing.

Trade-off: the Update goes through the main resource so admission webhooks see the write. Acknowledged in the controller docstring and docs/operator/storage-version-migration.md. Webhooks that need to ignore migration-only writes can branch on the annotation in the diff. The original /status webhook-bypass benefit was always notional since the SSA wasn't writing etcd at all.

The HappyPath envtest now snapshots each CR's resourceVersion before reconcile and asserts it has bumped — empirical proof the bug is fixed.

jhrozek · 2026-04-23T19:45:50Z

+			if err := r.restoreOne(ctx, gvk, u, hasStatusSubresource); err != nil {
+				if apierrors.IsNotFound(err) || apierrors.IsConflict(err) {
+					logger.V(1).Info("skip CR — deleted or updated elsewhere",
+						"object", client.ObjectKeyFromObject(u), "err", err)


IsNotFound is fine to swallow, but I'm not sure IsConflict is. The comment says "storage is already fresh", but a concurrent writer using the deprecated API version (still served during the deprecation window) produces a write encoded at the old version. If our SSA races with that writer and comes back Conflict, the CR is still stored at the old version, we silently skip it, and then patchStoredVersions trims storedVersions to [storageVersion]. That leaves storedVersions claiming we're clean while there's still an old-version-encoded object in etcd.

Options: retry the apply on Conflict (bounded), or track a "swallowed-conflict" flag and refuse to trim storedVersions on the pass where any occurred. Either way I don't think we should treat Conflict the same as NotFound here.

Agreed — fixed in dad30c9. restoreCRs now tracks per-pass conflict count in a function-local counter. IsConflict is logged at V(1) and skipped per-CR (still treated as transient), but if any conflicts occurred and no other errors did, the function returns a sentinel errMigrationRetriedDueToConflicts. Reconcile propagates it without calling patchStoredVersions, so storedVersions stays at [v1alpha1, v1beta1] until a conflict-free pass succeeds.

Added a new test ConflictDuringRestoreLeavesStoredVersionsUntouched that injects an apierrors.NewConflict(...) for one CR via a wrapping client, asserts the sentinel error is returned, asserts storedVersions is untouched, then drops the injection and verifies the next reconcile completes cleanly.

IsNotFound is still a clean swallow with no error — the object is gone, it can't be at the old version anymore.

jhrozek · 2026-04-23T19:45:50Z

+	}
+}
+
+func (c *migrationCache) has(crdName string, uid apitypes.UID, resourceVersion string) bool {


The cache only evicts expired entries when has() is called with the same key. Once a CR is deleted from the cluster its UID never comes up in a list again, so its entry stays resident until the operator restarts. For an operator that runs for weeks with CR churn (ephemeral MCPServers, GitOps recreations) the map grows unbounded, each entry is small but never released.

Is this an accepted tradeoff given the 1h TTL and typical restart cadence? If so, happy to drop this. Otherwise a background sweep via mgr.Add(RunnableFunc(...)) or an expirable LRU would bound it.

Right, the lazy eviction never frees entries for deleted CRs. Fixed in dad30c9 by adding a periodic background sweep: a manager.RunnableFunc registered in SetupWithManager ticks every 10m (configurable via the new CacheGCInterval field) and calls cache.gc() to evict expired entries unconditionally. Ticker is Stop()'ed via defer; the runnable returns on ctx.Done() so it shuts down cleanly with the manager.

Went with hand-rolled GC over k8s.io/utils/lru to keep the public API surface small. Easy to swap later if we want capacity-bounding too.

jhrozek · 2026-04-23T19:45:50Z

+			servedCount++
+		}
+	}
+	return servedCount > 1


I think this branch produces the opposite of what the doc comment describes. When storedVersions is already [storageVersion] and more than one version is still served (the common long-lived state during a multi-release deprecation window), this returns true, which forces a full list pass on every CRD event. The migrationCache suppresses the per-CR SSAs but the list call itself still fires, and a restart clears the cache.

If we want to re-scan periodically during deprecation as a defensive measure, worth stating so and acknowledging the cost. If not, return len(stored) != 1 || stored[0] != storageVersion on its own should be sufficient, new writes at the old version would append to storedVersions and re-trigger migration via the first branch.

Good catch — you're right that under None-strategy with identical schemas, normal writers cannot reintroduce stale versions to storedVersions, so the defensive re-scan has no scenario to defend against. Simplified in dad30c9 to your suggested form: return len(stored) != 1 || stored[0] != storageVersion. Doc comment updated to be honest about the (lack of) defensive value.

jhrozek · 2026-04-23T19:45:50Z

+
+//+kubebuilder:rbac:groups=apiextensions.k8s.io,resources=customresourcedefinitions,verbs=get;list;watch
+//+kubebuilder:rbac:groups=apiextensions.k8s.io,resources=customresourcedefinitions/status,verbs=update;patch
+//+kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=*,verbs=get;list;patch


Minor, just wanted to surface this for sign-off. The wildcard grants get/list/patch across every current and future kind in the toolhive group, not just the opted-in ones. I get that the opted-in set isn't known at codegen time, so the alternative (explicit per-kind rules) is awkward. Probably fine, but worth a brief comment on the marker noting the intentional wildcard plus the isManagedCRD gate as defense-in-depth.

Added the comment in dad30c9 — explicit note above the kubebuilder marker block explaining the wildcard is intentional (opt-in set isn't known at codegen time), with isManagedCRD documented as the runtime gate forming defence in depth.

While I was there: the verbs themselves changed too — patch → update for both the main resource and /status to match the new write path (Update + annotation toggle, see your other comment). RBAC regenerated via task operator-manifests.

jhrozek · 2026-04-23T19:45:50Z

+// Per-kind escape hatch: remove the label from the CRD (emergency only — will
+// be re-applied by GitOps / helm upgrade).
+type StorageVersionMigratorReconciler struct {
+	client.Client                      // cached reads for CRs (unused in this reconciler, kept for kubebuilder convention)


Nit: the comment says "unused in this reconciler, kept for kubebuilder convention", but the field is used on lines 280 (r.Client.Status().Apply), 285 (r.Apply), and 301 (r.Client.Status().Patch). A maintainer reading the comment might assume they can drop it. Maybe change to something like "used for SSA writes and status patch; reads go through APIReader".

Yep, embarrassing comment drift — the field is used at three call sites. Fixed in dad30c9:

client.Client // used for /status writes (CR re-stores, CRD storedVersions patch); reads go through APIReader

(Slightly tweaked from your suggestion to mention CRD storedVersions, since the reconciler also uses Client.Status().Patch on the CRD itself.)

jhrozek · 2026-04-23T19:45:51Z

+var _ = AfterSuite(func() {
+	By("tearing down envtest")
+	cancel()
+	time.Sleep(100 * time.Millisecond)


The suite doesn't start a controller manager per the docstring, so I'm not sure what this sleep is waiting for. cancel() doesn't block and there are no background goroutines that need to drain. If it turns out envtest.Stop() flakes without it, there's usually a proper fix (draining a specific channel) rather than a fixed delay. Otherwise it can probably just go.

Right, dead code copy-pasted from virtualmcp/suite_test.go (which actually has a manager goroutine). Removed in dad30c9 along with the now-unused time import.

@jhrozek

…n toggle Empirical verification (probe test against envtest) confirmed the reviewer's hypothesis: metadata-only Server-Side Apply on /status is short-circuited by the apiserver — resourceVersion does not bump and etcd is not re-encoded. The "force a re-encode" claim the original implementation rested on was hollow. This commit replaces the SSA path with the kube-storage-version-migrator pattern: Get the live object, toggle a timestamp annotation to force a real diff, then Update. Trade-off acknowledged in the controller docstring and user docs: the Update goes through the main resource so admission webhooks see it. Webhooks that need to ignore migration-only writes can branch on the presence of toolhive.stacklok.dev/storage-version-migrated-at in the diff. The original /status webhook-bypass benefit was always notional — the SSA wasn't writing etcd at all. Other fixes from the same review pass: - Track per-pass IsConflict count in restoreCRs; if any conflicts were swallowed and no other errors occurred, return a sentinel error so Reconcile leaves storedVersions untouched (the post-conflict state of the affected object is unverified). Next reconcile retries cleanly. New ConflictDuringRestoreLeavesStoredVersionsUntouched test covers this. - Bound migrationCache memory: register a manager.RunnableFunc that walks the cache every 10m and evicts expired entries. Prior code only evicted lazily on key re-query, which leaked entries for deleted CRs. - Simplify isMigrationNeeded to a single line. The servedCount > 1 branch was forcing list passes during deprecation windows for no actual benefit (None conversion + identical schemas means writers cannot reintroduce stale storedVersions). - Document the wildcard CR RBAC as defence-in-depth alongside the isManagedCRD runtime gate. - Fix the stale "(unused in this reconciler)" comment on the embedded Client field — it is used for CR Updates and the CRD status patch. - Drop a 100ms time.Sleep in AfterSuite that was dead code copied from virtualmcp's suite (which has a manager goroutine to drain; we don't). HappyPath now snapshots each CR's resourceVersion before reconcile and asserts it has bumped after — empirical proof the underlying re-encode actually happened. All 8 envtest cases pass; lint clean. Refs PR #5011 review by @jhrozek. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@jhrozek

The annotation toggle was a workaround for a misdiagnosed problem. Earlier in this branch we observed empty SSAs being elided and generalised it to "all unmodified writes elide" — and added a toolhive.stacklok.dev/storage-version-migrated-at annotation toggle to force a real diff per write. A second-opinion investigation pointed out the elision check happens AFTER storage encoding: the apiserver re-encodes the request body at the current storage version, then byte-compares against etcd. For the actual migration scenario (CR stamped at an older apiVersion in etcd, storage version now newer) the encoded apiVersion differs, the comparison fails, and the write proceeds. Verified empirically: a CR stored at v1alpha1 with storage flipped to v1beta1 has plain Get+Update bump RV from 202 to 204; a CR already at v1beta1 has plain Get+Update elide cleanly. Both behaviours are correct — the elision is exactly the no-migration-needed case. Changes: - restoreOne: plain Get + Update of the unmodified live object. No annotation, no timestamp, no MigrationTimestampAnnotation constant. - Controller docstring rewritten to describe the actual mechanism (encode-then-compare elision) and cite ksvm#65 for posterity. - HappyPath envtest loosened: per-CR resourceVersion bump assertions are dropped (already-clean CRs correctly elide; checking RV bumps on them was an artefact of the annotation-induced writes). The storedVersions convergence assertion remains as the contract test. - New cross-version envtest "re-encodes CRs that are stored at a prior storage version" exercises the real migration scenario: install CRD with v1alpha1 storage, create CR, flip storage to v1beta1, run reconciler, assert storedVersions trims and the CR's RV bumped (proving the apiserver actually rewrote etcd). - User docs updated to drop annotation references and describe the real mechanism. Reference implementation citation switched from cluster-api/crdmigrator to kube-storage-version-migrator (the actual upstream SIG tool we share semantics with). - FallsBackWhenNoStatusSubresource test removed: the SSA path it exercised no longer exists. Plain Update on a CRD without /status is just a normal Update; covered by the happy-path spec. 8/8 envtest cases pass. Lint clean. No RBAC drift (verbs already correct). Refs PR #5011 review by @jhrozek. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ChrisJBurns · 2026-04-29T21:09:31Z

Design correction — withdrawing the "upstream is broken" claim

I owe a correction on the design pivot from dad30c90. That commit message and the inline replies on this PR both claimed kube-storage-version-migrator upstream and Cluster-API's crdmigrator were silently broken, based on probes that showed plain Update of an unmodified CR was elided by the apiserver. That conclusion was wrong — I was testing the wrong scenario.

The corrected understanding (verified against ksvm#65 and confirmed empirically against our envtest in ab03d5e1):

The apiserver's storage layer elision happens after the storage encoder runs, comparing freshly-encoded bytes against etcd bytes.
The encoded form includes the storage version's apiVersion stamp.
When etcd holds bytes stamped at an older version (the actual migration scenario) and the controller writes via the current storage version, the encoded bytes differ → write proceeds → RV bumps → etcd re-encoded.
When etcd is already at the current storage version (no migration needed), the bytes match → write elides — and that's correct, not a bug.

My earlier probes only tested the second case (CR already at storage version), saw the elision, and overgeneralised. The probe in ab03d5e1's new envtest case exercises the actual cross-version migration scenario and confirms plain Update bumps RV correctly.

Practical consequence for this PR: the annotation toggle (toolhive.stacklok.dev/storage-version-migrated-at) added in dad30c90 was unnecessary. ab03d5e1 removes it — the controller now does plain Get + Update, which matches upstream kube-storage-version-migrator exactly. CRs no longer carry the persistent migration annotation that earlier review surfaced as a concern.

Apologies to the upstream maintainers of kube-storage-version-migrator and cluster-api/crdmigrator for the mischaracterisation. Both projects' mechanisms are correct for the migration scenario they target.

ChrisJBurns and others added 7 commits April 22, 2026 13:25

Register apiextensions/v1 scheme in operator

2b82636

Prepares the operator manager for an upcoming controller that watches CustomResourceDefinition objects. No runtime behaviour change yet. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jhrozek reviewed Apr 23, 2026

View reviewed changes

ChrisJBurns and others added 2 commits April 29, 2026 15:02

Conversation

ChrisJBurns commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of change

Test plan

Special notes for reviewers

Uh oh!

jhrozek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChrisJBurns commented Apr 29, 2026

Design correction — withdrawing the "upstream is broken" claim

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChrisJBurns commented Apr 22, 2026 •

edited

Loading