[rpk k8s-multicluster] operator debug bundle by hidalgopl · Pull Request #1504 · redpanda-data/redpanda-operator

hidalgopl · 2026-05-08T12:00:14Z

Summary

Adds rpk-k8s multicluster bundle, a diagnostics tool for the multicluster operator analogous to rpk debug bundle for Redpanda. Given a single kubeconfig pointing at one of the operator's host clusters, it discovers the remaining peers via labeled cache Secrets, then collects per-cluster artifacts (operator pod manifest, deployment, logs, metrics, raft status, TLS material) and runs the existing multicluster checks pipeline as both per-cluster and cross-cluster checks. Output is a single zip archive.

This is the operator-side complement to rpk debug remote-bundle — useful for triage of stretch clusters where you need a snapshot of all three operator instances in one shot.

What's in the bundle

stretch-bundle/
├── manifest.json                          # schema, generation timestamp, namespace, redaction + sampling flags
├── status.txt                             # human-readable summary (peers, errors, sizes)
├── errors.txt                             # non-fatal collection errors (omitted when none)
├── cross-cluster/
│   └── checks.json                        # output of cross-cluster checks (e.g. raft quorum)
└── clusters/
    ├── peer-0/
    │   ├── checks.json                    # per-cluster checks (pod, deployment, raft, TLS, ...)
    │   ├── deploy-args.txt                # operator container args (flags as configured)
    │   ├── deployment.yaml                # the multicluster operator Deployment
    │   ├── pod.yaml                       # the active operator Pod
    │   ├── raft-status.json               # raft node status (term, leader, progress)
    │   ├── logs/
    │   │   └── manager.log                # current container log; previous log appended on restart
    │   ├── metrics/
    │   │   ├── t0_metrics.txt             # first /metrics scrape
    │   │   └── t1_metrics.txt             # second /metrics scrape (default --metrics-samples=2 at 10s interval)
    │   └── tls/
    │       ├── ca.crt
    │       ├── tls.crt
    │       ├── tls-key-match.txt          # whether tls.key on disk matches tls.crt
    │       └── tls-secret.yaml            # full Secret manifest (tls.key redacted by default)
    ├── peer-1/
    │   └── ...                            # same layout as peer-0
    └── peer-2/
        └── ...                            # same layout as peer-0

(Sub-directory names under clusters/ are the kube context names from the discovered kubeconfigs — peer-0 / peer-1 / peer-2 shown above for readability.)

managedFields are stripped from every Kubernetes object. tls.key and kubeconfig.yaml Secret payloads are redacted unless --include-private-keys is passed. Per-artifact failures are written to errors.txt rather than aborting the bundle.

Discovery

The user passes a single kubeconfig (--kubeconfig) pointing at any one peer. The tool lists Secrets in the operator namespace matching app.kubernetes.io/component=multicluster-kubeconfig-cache (a new label added in this PR), parses each cached peer kubeconfig, and proceeds against all peers. No pre-merging required.

In environments where the cached peer kubeconfigs have server URLs that aren't reachable from the operator (e.g. local stretch dev envs running vClusters inside a k3d host), pass all contexts explicitly with repeated --context — the bundle command sees len(starting) > 1, skips discovery, and uses only what was given.

Metrics scraping

/metrics is gated by controller-runtime's filters.WithAuthenticationAndAuthorization, so apiserver pod-proxy doesn't authenticate. The bundle instead port-forwards into the operator pod and presents a Bearer token minted via TokenRequest for the operator ServiceAccount. The chart's existing *-metrics-reader ClusterRole (nonResourceURLs: ["/metrics"]) is now bound to that SA so the scrape returns 200 instead of 403 — applies to both operator/chart (single-cluster) and operator/multicluster (multicluster).

By default, two samples are taken 10s apart per cluster (matches rpk debug bundle's default). Multiple samples let an investigator compute counter rate-of-change post-hoc without needing a live Prometheus — particularly useful for rate-style raft signals (leader changes, send errors, drops, RTT). Samples land at metrics/t0_metrics.txt, t1_metrics.txt, … and the manifest records metricsSamples + metricsIntervalSeconds so a bundle reader can interpret the layout.

TLSCheck behavior

TLSCheck is a snapshot read (Get) rather than WaitFor-style polling. A debug bundle should never block on cert-manager finishing — if the TLS secret is missing or mid-rotation, that is itself a diagnostic and surfaces in checks.json. The check now falls straight from deployment-volume matching to scan-by-suffix listing, dropping the brittle synthesis of <SecretPrefix>-multicluster-certificates (which produced invalid Secret names when SecretPrefix was a kube context name like vcluster_..._k3d-harpoon).

Flags

--kubeconfig / --context / --namespace / --service-name (shared connection flags; --service-name is the app.kubernetes.io/name label value, default operator)
-o, --output — output path (defaults to ./operator-bundle-<unix-ts>.zip)
--include-private-keys — keep tls.key and kubeconfig.yaml payloads
--skip-logs / --logs-size-limit (default 5M, supports MiB / GiB) / --logs-tail-lines (default 5000)
--skip-metrics / --metrics-samples (default 2, must be ≥ 2) / --metrics-interval (default 10s)
-v, --verbose — stderr progress trace with timestamps for diagnosing hangs

Each multicluster operator caches its peers' kubeconfigs as Secrets in the local cluster (named "<KubeconfigName>-<peerName>" in KubeconfigNamespace) so a new raft leader can re-engage peers without a live gRPC call. Today these Secrets carry no labels, so any external tool that wants to enumerate them has to know the operator's --kubeconfig-name flag to build the right name prefix. That couples the consumer to a per-deploy CLI flag and makes mixed-version deploys awkward. Apply a stable label set on every CreateOrUpdate of a cache Secret: app.kubernetes.io/component = multicluster-kubeconfig-cache app.kubernetes.io/managed-by = redpanda-multicluster-operator cluster.redpanda.com/multicluster-peer = <peerName> The component label is the discovery selector. The managed-by label is informational only -- it surfaces ownership in "kubectl get secret -L". The peer label records the peer name as data so a reader doesn't have to derive it from the Secret's name. Labels are re-applied on every reconcile so they self-heal when an older operator version, or a user, leaves unlabelled Secrets behind. writeCachedKubeconfig now takes peerName as an explicit parameter instead of trying to recover it from the existing secret name, so the same value lands in both the data path and the label. Consumed by an upcoming "rpk k8s multicluster bundle" subcommand to enumerate peer kubeconfigs from a single starting cluster without needing to parse the operator's deployment args. Constants are exported so the writer side here and the future reader side can't drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The bundle subcommand introduced in a follow-up commit needs to look up cache Secrets by name and parse the kubeconfig bytes they hold. Today these helpers (kubeconfigCacheSecretName, loadKubeconfig, loadKubeconfigFromBytes) live as package-private functions inside pkg/multicluster, so any reader outside the package would have to re-implement them and risk drifting from the writer side. Promote them to exported names with no behavior change: kubeconfigCacheSecretName -> KubeconfigCacheSecretName loadKubeconfig -> LoadKubeconfig loadKubeconfigFromBytes -> LoadKubeconfigFromBytes In-package callers updated, plus a stale comment reference in raft_bootstrap_test.go. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The status subcommand prints a snapshot of multicluster operator health to stdout. For Support tickets we want the same data captured to a ZIP file so it can be attached and inspected offline, analogous to how rpk debug bundle relates to the live broker admin API. This is the operator-side counterpart. The new "rpk k8s multicluster bundle" subcommand reuses the existing per-cluster and cross-cluster check pipeline (so anything status surfaces also lands in the bundle) and serialises the state each check accumulates on its CheckContext into a structured zip: operator-bundle-<unix-ts>.zip ├── manifest.json # schema version, timestamp, namespace, │ # serviceName, includePrivateKeys, clusters ├── status.txt # the same human-readable table the │ # status command prints ├── errors.txt # non-fatal collection failures, omitted │ # when nothing went wrong ├── clusters/<peer>/ │ ├── checks.json # []checks.Result from RunClusterChecks │ ├── pod.yaml # operator Pod, managedFields stripped │ ├── deployment.yaml # operator Deployment, managedFields stripped │ ├── deploy-args.txt # container args, one per line │ ├── tls/ │ │ ├── ca.crt # PEM, public material │ │ ├── tls.crt # PEM, public material │ │ ├── tls-secret.yaml # full Secret, tls.key redacted │ │ └── tls-key-match.txt │ └── raft-status.json └── cross-cluster/checks.json By default only one --kubeconfig (or --context) is required. The command discovers peer clusters from the labelled cache Secrets that the operator's raft-bootstrap flow writes onto each cluster, which means the tool keeps working when one peer is down: pick any reachable starting cluster and the rest of the roster falls out of its apiserver. Multi-context input is still honoured and bypasses discovery, useful when the cache is empty or a cluster is excluded on purpose. Per-cluster collection failures (a peer apiserver unreachable, a malformed cache Secret) are recorded in errors.txt and the bundle proceeds with whatever it could collect, mirroring rpk debug bundle's "best-effort" semantics. Failures that prevent any output (can't open the zip, can't resolve the starting connection) are returned to the caller. Sensitive material is redacted by default. Both tls.key and any cached peer kubeconfig.yaml in serialised Secrets become "REDACTED" unless --include-private-keys is passed; managedFields are stripped unconditionally. The kubeconfig redaction in particular matters because the starting cluster's apiserver holds credentials to every peer; the bundle must not leak them. Tests: - TestRedactSecret pins the redaction defaults (pure unit test, no envtest). - TestIntegrationBundleRun_RoundTrip writes a complete bundle against a single envtest with no operator deployed, and asserts file presence, manifest fields, and that errors.txt is omitted when no errors occurred. - TestIntegrationBundleDiscoverPeers seeds two cache Secrets on a starting envtest (one well-formed pointing at a real second envtest, one malformed missing the peer label) and verifies discovery returns the well-formed peer as a working roster entry while recording the malformed Secret by name in errors.txt. The integration tests use a deliberate subset of cluster checks (PodCheck + DeploymentCheck) via the new BundleConfig.ClusterChecks override; the omitted checks (TLSCheck, RaftCheck, etc.) use WaitFor / port-forward and would dominate test wall-clock time against a vanilla envtest with no operator pod. Production defaults are unchanged. Phase 2 (operator pod logs) and Phase 3 (operator metrics) are deferred to follow-up commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1 captured the operator's check state but not the actual log output of the operator container, which is by far the most useful artifact when debugging a deployment that's misbehaving in flight. This change adds per-container log collection on top of the same roster the bundle already builds. For every pod recorded on a CheckContext (the operator pod found by PodCheck), the bundle now fetches: - the current log of every init container and main container - the *previous* log of any container whose RestartCount > 0, so a freshly-CrashLooping pod surfaces both its current pre-crash output and the prior instance's tail and writes them to clusters/<peer>/logs/<container>.log (and .previous.log when applicable). Per-container fetches are independent — a single failing container is recorded in errors.txt and the next container is attempted, so a permission gap on one pod can't lose the whole bundle. The implementation goes through a small logFetcher abstraction rather than calling kubernetes.Interface directly. Production builds a fetcher from each connection's REST config; tests inject a fake because envtest's apiserver doesn't run kubelets and cannot return real container output. BundleConfig exposes LogFetcherFor for that injection point. Three new flags, naming aligned with `rpk debug bundle`: --skip-logs disable log collection entirely --logs-size-limit string per-container byte cap (default "5M"; "0" disables the cap) --logs-tail-lines int64 per-container line cap (default 5000; 0 disables the cap) Size parsing accepts both decimal SI ("5M", "1G") and binary IEC ("5Mi", "1GiB") suffixes — the dispatcher at parseLogsSize routes binary forms through units.RAMInBytes (1024-based) and decimal forms through units.FromHumanSize (1000-based), since RAMInBytes silently treats "5M" as 5 MiB. This matters because Kubernetes resource limits and `rpk debug bundle` use opposite conventions and users will mix them. manifest.json gains three fields recording the policy that was applied (logsCollected, logsLimitBytes, logsTailLines) so a reader can tell at a glance whether logs were skipped, capped, or truncated. SchemaVersion stays at 1 because the additions are backward-compatible (older readers ignore unknown fields). Tests: - TestCollectClusterLogs_HappyPath: fake logFetcher with init + two main containers (one with RestartCount > 0); asserts the expected file set, that LimitBytes / TailLines propagate, and that current vs previous fetches map to the right filenames. - TestCollectClusterLogs_PerContainerErrorIsRecorded: one container fails, the other succeeds; the failing container's log is absent, the succeeding container's is present, and the failure is in the returned []error. - TestCollectClusterLogs_NoPodNoOp: cc.Pod == nil short-circuits cleanly (no panic, no error). - TestResolveLogsOptions: table-driven; covers default, decimal suffix ("10MB"), binary suffix ("10MiB"), explicit zero, negative-tail clamping, and unparseable input. - TestIntegrationBundleRun_RoundTrip extended to assert the new manifest fields and that no logs/ entries appear when PodCheck found no pod (cc.Pod is nil). operator/go.mod: github.com/docker/go-units moves from indirect to direct (parser dependency). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The bundle now scrapes the multicluster operator's /metrics endpoint via the apiserver pod-proxy and writes the raw Prometheus exposition into clusters/<peer>/metrics/metrics.txt. Why apiserver pod-proxy rather than port-forward + direct GET: the operator's metrics server runs with controller-runtime's filters.WithAuthenticationAndAuthorization filter, which performs a TokenReview / SubjectAccessReview on every request. Going through the apiserver proxy lets the bundle command inherit the kubeconfig identity (the bearer token or client cert is what the apiserver sees) without rebuilding the TLS trust chain or replicating the auth dance. It also handles both the plain-HTTP and HTTPS variants of the metrics server uniformly: the apiserver terminates TLS on our behalf and the bundle just GETs /metrics on the proxy URL. The proxy URL needs the right scheme prefix though, because when the metrics server is HTTPS the apiserver has to know to handshake outbound. We detect this by parsing --metrics-cert-path out of CheckContext.DeployArgs (already populated by DeploymentCheck): when present, the prefix is "https:<pod>:<port>"; otherwise "http:<pod>:<port>". The port itself comes from --metrics-bind-address; absence of that flag means metrics are disabled and we silently skip the cluster rather than producing an error. Per-cluster scrape failures (RBAC denial on pods/proxy, the operator pod not exposing the metrics port on the expected container, transient apiserver errors) are recorded in errors.txt and don't fail the bundle. metrics.txt is omitted for that cluster but the rest of the bundle is still produced. New flag: --skip-metrics skip /metrics scraping entirely manifest.json gains a metricsCollected field recording whether scraping was enabled on this run. Tests: - TestCollectClusterMetrics_HappyPath: fake fetcher; asserts the canned exposition lands at clusters/<ctx>/metrics/metrics.txt and that the namespace/pod/scheme/port arguments propagate. - TestCollectClusterMetrics_HTTPSWhenCertPathSet: --metrics-cert-path in DeployArgs flips the scheme to "https". - TestCollectClusterMetrics_MissingFlagSkipsCleanly: deploy args without --metrics-bind-address must not call the fetcher and must not produce an error. - TestCollectClusterMetrics_ErrorIsRecorded: fetcher errors surface in the returned []error and metrics.txt is absent from the zip. - TestParseMetricsPort: table-driven; default ":8443", explicit host:port, missing flag, empty value, garbage value, port out of range. - TestMetricsScheme: HTTP without --metrics-cert-path, HTTPS with it. - TestIntegrationBundleRun_RoundTrip extended to assert the new manifest field and that no metrics/ entries appear when PodCheck found no pod. The multicluster operator's RBAC must include the "pods/proxy" verb in the operator's namespace for this to work in production. The chart's existing rpk-debug-bundle role doesn't cover it, so deployments may need an additional Role; tracking that as a separate change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 3 scraped /metrics through the apiserver pod-proxy. That turned out not to work against the multicluster operator: the operator's metrics server runs with controller-runtime's filters.WithAuthenticationAndAuthorization, and the apiserver pod-proxy opens a new HTTP connection to the pod *without* propagating the user's auth headers. The metrics server sees an unauthenticated request and returns 401, surfaced to client-go as "the server has asked for the client to provide credentials". This is a fundamental incompatibility between the apiserver pod-proxy (which doesn't carry per-request auth) and any endpoint fronted by an authn/authz filter — not something the operator did wrong, and not something we can fix from rpk's side without a different transport. Replace the apiserver-proxy fetcher with a port-forward + Bearer token approach: 1. cs.CoreV1().Pods(ns).Get to find the pod (we need its ServiceAccountName). 2. kube.Ctl.PortForward to map the metrics containerPort to a local port. The chart's multicluster deployment template already declares port 8443 as containerPort:8443/name:https, so PortForward picks it up automatically. 3. cs.CoreV1().ServiceAccounts(ns).CreateToken to mint a short-lived (10 minute) Bearer token for the pod's SA via the TokenRequest API. 4. http.Client GET <scheme>://127.0.0.1:<localPort>/metrics with Authorization: Bearer <token>. TLSClientConfig InsecureSkipVerify because the metrics server typically uses a self-signed cert and we're hitting 127.0.0.1 over a port-forward — there is no MITM concern. The metrics server's filter then TokenReview's the Bearer token against the apiserver, so the SubjectAccessReview happens against the SA's identity. As long as the SA has nonResourceURLs: /metrics get (granted by the chart's metrics-reader ClusterRole + binding, in a follow-up commit), the scrape returns 200 with the Prometheus exposition. If the SA lacks the binding the scrape returns 403 and the error in errors.txt now includes the HTTP status + the first 512 bytes of the body, so users can see exactly what the metrics server denied. The metricsFetcher interface is unchanged, so existing fake-fetcher unit tests continue to apply. The kubeMetricsFetcher production implementation is the only thing that moved; tests that exercise its surface (port lookup, error wrapping) would need a real apiserver and are deferred to a multicluster integration test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Today bundle Run is a black box: it goes off, does network I/O across N clusters, and returns either silently or with the final "bundle written to ..." line. When something hangs or feels slow, you have nothing to look at — even a goroutine dump from SIGQUIT requires fishing the PID and running another tool. Add a --verbose / -v flag that prints a one-line progress message to stderr at every major step: - connection resolution - peer discovery (with per-peer "discovered" lines and per- Secret discovery warnings inline) - per-cluster check pipeline, with each check name and elapsed duration as it completes — invaluable for "which check hung?" - cross-cluster checks (single line; they're typically fast) - per-cluster log collection and metrics scrape, each with elapsed time - manifest / status.txt writes Each line is prefixed with "[bundle HH:MM:SS.mmm]" so you can eyeball gaps between two log lines and pinpoint where the wall clock got stuck. Off by default — clean machine-readable invocations are unchanged. Implementation: - BundleConfig gains Verbose (bool) and ProgressOut (io.Writer) fields. Verbose is bound to the flag; ProgressOut is a test hook (defaults to os.Stderr). - progressLogger() returns a no-op closure when Verbose is false (no per-call cost beyond format-arg evaluation), and a stderr writer with timestamp prefix when Verbose is true. - The cluster-check loop now inlines what RunClusterChecks did so we can wrap each check call in a logf timing pair. The per-check Results aggregation behaviour is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The bundle subcommand's --logs-size-limit flag promoted github.com/docker/go-units from indirect to direct in Phase 2 (67f3422), but task generate wasn't run as part of that commit so third_party.md missed the new entry. Catch it up now; no other licenses-affecting dependency changes in flight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The chart already emits a metrics-reader ClusterRole granting nonResourceURLs:["/metrics"] verbs:["get"] (rbac.go:107-123) but deliberately leaves it unbound — the comment was "We skip over making a binding for the metrics viewer role." The intent was that consumers (Prometheus, etc.) bind it from their own ServiceAccount. That intent missed two cases that actually exist in the chart itself: 1. The chart ships a ServiceMonitor (_servicemonitor.go.tpl) that scrapes /metrics over HTTPS using the operator pod's projected SA token — i.e. it scrapes *as the operator's own SA*. controller-runtime's metrics server runs with filters.WithAuthenticationAndAuthorization, which TokenReview's the bearer token and then SubjectAccessReview's the resulting identity. Without a binding, that SAR returns "no" for the operator SA and the scrape is silently denied. 2. `rpk k8s multicluster bundle` (in development on this branch) authenticates the same way and hits the same 403. The minimum-privilege fix is a single ClusterRoleBinding from the operator's own SA to the existing metrics-reader ClusterRole. The role's only rule is `nonResourceURLs:["/metrics"], verbs: ["get"]` — the smallest possible privilege bump. External consumers (Prometheus etc.) still bind to the same ClusterRole from their own SA, so existing deployments are unaffected beyond the new binding. The "skip" comment is replaced with a comment explaining the new behaviour and the fact that other consumers still bind separately. Regenerated _rbac.go.tpl via `task generate` and the chart TestTemplate golden files via `go test ... -run TestTemplate -update-golden`. (Note: CLAUDE.md says these tests use `-update`, but the actual flag is `-update-golden`. Worth a doc fix in a follow-up.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Parallel to the chart-side fix in 5ab86fd: the multicluster operator's per-StretchCluster RBAC scaffolding (rbac.go) didn't emit the self-metrics ClusterRole or its binding either. Same two consumers hit the same 403: - the chart's bundled ServiceMonitor scrapes /metrics over HTTPS using the operator pod's projected SA token, - `rpk k8s multicluster bundle` mints a TokenRequest token for the operator SA and scrapes /metrics over a port-forward. Both authenticate as the operator SA and need nonResourceURLs:["/metrics"] verbs:["get"] in a ClusterRole bound to that SA. Add the ClusterRole to clusterRoles() (gated on the existing RBAC.IsEnabled() guard, like rack-awareness). The binding is emitted automatically because clusterRoleBindings() iterates clusterRoles() and binds each to GetServiceAccountName(). One extra rule, smallest possible privilege bump ("get /metrics", nothing else). Test goldens regenerated: - operator/multicluster/testdata/render-cases.resources.golden.txtar via `go test ./operator/multicluster/... -update-golden` - operator/internal/lifecycle/testdata/stretch-cluster-cases.resources.golden.txtar via `go test ./operator/internal/lifecycle/... -update-golden` (the stretch-cluster lifecycle test exercises the same RBAC rendering through its render pipeline) After this lands, fresh deployments via `task dev:setup-multicluster-dev-env` give every operator pod metrics access to its own /metrics out of the box. Existing deployments need a chart upgrade for the binding to materialise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirror `rpk debug bundle`'s default of two /metrics samples 10s apart per cluster. A single Prometheus snapshot loses the rate-of-change information that's most useful for raft debugging — leader changes, send errors, drops, RTT histograms are all rate-style signals where two samples let an investigator compute deltas post-hoc without needing a live Prometheus. Layout change inside the bundle: clusters/<context>/metrics/metrics.txt ↓ clusters/<context>/metrics/t0_metrics.txt clusters/<context>/metrics/t1_metrics.txt Same file naming convention as `rpk debug bundle`, so anyone already familiar with that bundle format reads ours the same way. New flags: --metrics-samples int # default 2, must be >= 2 --metrics-interval duration # default 10s, must be > 0 Validation matches `rpk debug bundle`: a single sample defeats the point (no delta possible), so anything below 2 is rejected at flag-parse time before any work is done. Per-sample failures are recorded in errors.txt and the next sample is still attempted; the `--metrics-interval` sleep respects context cancellation so a SIGINT'd bundle exits promptly. manifest.json grows two fields recording the policy: metricsSamples: 2 metricsIntervalSeconds: 10 so a bundle reader can interpret the t<i>_metrics.txt filenames without guessing. Verbose mode (`-v`) now also logs each per-sample scrape ("[<context>] /metrics sample 1/2") which is the difference between "is the bundle hung" and "we're in the inter-sample sleep, hold on 10s". Tests cover the new behaviour: multi-sample happy path with distinct per-sample bodies landing in distinct files; per-sample failure that doesn't abort the loop; the existing single-sample paths updated to assert against `t0_metrics.txt`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two issues hit at once when running the bundle against a stretch dev env where the operator pods are deployed under non-default labels: 1. `findSecretName` returned "" because the deployment volume it expected (`<SecretPrefix>-multicluster-certificates`) wasn't on the deployment. The fallback branch then synthesised a secret name as `<SecretPrefix>-multicluster-certificates` and tried it verbatim. With `SecretPrefix` derived from a kube context name (e.g. `vcluster_vc-...-0_..._k3d-harpoon`) the synthesised name contains underscores — invalid as a Kubernetes Secret name and definitely not present in the cluster. 2. The check then called `WaitFor` with a "return nil error" condition, which polls indefinitely until the secret appears. Against a synthesised, never-present name that means the bundle hangs at the first peer's `tls` check and never advances. A debug bundle is a snapshot, not a probe. If the secret is missing or mid-rotation, that is itself the diagnostic and belongs in `checks.json` as a Fail() — not as a hang. Replace `WaitFor` with `Get` so a missing secret fails fast. Also drop the brittle name synthesis. Fall straight from deployment-volume matching (the precise path) to scan-by-suffix listing of all Secrets in the namespace (the robust fallback). The previous middle step — synthesising from `SecretPrefix` — only worked when the user happened to pass the chart's release name as `--service-name`'s counterpart, and produced obviously-wrong names otherwise. The existing test commentary in `bundle_test.go:49` ("`TLSCheck` (uses WaitFor which polls until the secret exists or the context is cancelled — hangs against a vanilla envtest)") acknowledged the broken behaviour as a known limitation. With this change the comment's premise no longer holds — TLSCheck can be re-included in `fastTestChecks` in a follow-up if we want to expand integration coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hidalgopl and others added 12 commits May 7, 2026 16:46

hidalgopl requested review from RafalKorepta, andrewstucki, chrisseto and gene-redpanda as code owners May 8, 2026 12:00

operator: changelog entry for multicluster bundle

af4135d

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hidalgopl marked this pull request as draft May 8, 2026 12:05

hidalgopl changed the title ~~Pb/multicluster operator debug bundle~~ [rpk k8s-multicluster] operator debug bundle May 8, 2026

hidalgopl marked this pull request as ready for review May 8, 2026 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rpk k8s-multicluster] operator debug bundle#1504

[rpk k8s-multicluster] operator debug bundle#1504
hidalgopl wants to merge 13 commits intomainfrom
pb/multicluster-operator-debug-bundle

hidalgopl commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hidalgopl commented May 8, 2026

Summary

What's in the bundle

Discovery

Metrics scraping

TLSCheck behavior

Flags

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant