Skip to content

[rpk k8s-multicluster] operator debug bundle#1504

Open
hidalgopl wants to merge 13 commits intomainfrom
pb/multicluster-operator-debug-bundle
Open

[rpk k8s-multicluster] operator debug bundle#1504
hidalgopl wants to merge 13 commits intomainfrom
pb/multicluster-operator-debug-bundle

Conversation

@hidalgopl
Copy link
Copy Markdown
Contributor

Summary

Adds rpk-k8s multicluster bundle, a diagnostics tool for the multicluster operator analogous to rpk debug bundle for Redpanda. Given a single kubeconfig pointing at one of the operator's host clusters, it discovers the remaining peers via labeled cache Secrets, then collects per-cluster artifacts (operator pod manifest, deployment, logs, metrics, raft status, TLS material) and runs the existing multicluster checks pipeline as both per-cluster and cross-cluster checks. Output is a single zip archive.

This is the operator-side complement to rpk debug remote-bundle — useful for triage of stretch clusters where you need a snapshot of all three operator instances in one shot.

What's in the bundle

stretch-bundle/
├── manifest.json                          # schema, generation timestamp, namespace, redaction + sampling flags
├── status.txt                             # human-readable summary (peers, errors, sizes)
├── errors.txt                             # non-fatal collection errors (omitted when none)
├── cross-cluster/
│   └── checks.json                        # output of cross-cluster checks (e.g. raft quorum)
└── clusters/
    ├── peer-0/
    │   ├── checks.json                    # per-cluster checks (pod, deployment, raft, TLS, ...)
    │   ├── deploy-args.txt                # operator container args (flags as configured)
    │   ├── deployment.yaml                # the multicluster operator Deployment
    │   ├── pod.yaml                       # the active operator Pod
    │   ├── raft-status.json               # raft node status (term, leader, progress)
    │   ├── logs/
    │   │   └── manager.log                # current container log; previous log appended on restart
    │   ├── metrics/
    │   │   ├── t0_metrics.txt             # first /metrics scrape
    │   │   └── t1_metrics.txt             # second /metrics scrape (default --metrics-samples=2 at 10s interval)
    │   └── tls/
    │       ├── ca.crt
    │       ├── tls.crt
    │       ├── tls-key-match.txt          # whether tls.key on disk matches tls.crt
    │       └── tls-secret.yaml            # full Secret manifest (tls.key redacted by default)
    ├── peer-1/
    │   └── ...                            # same layout as peer-0
    └── peer-2/
        └── ...                            # same layout as peer-0

(Sub-directory names under clusters/ are the kube context names from the discovered kubeconfigs — peer-0 / peer-1 / peer-2 shown above for readability.)

managedFields are stripped from every Kubernetes object. tls.key and kubeconfig.yaml Secret payloads are redacted unless --include-private-keys is passed. Per-artifact failures are written to errors.txt rather than aborting the bundle.

Discovery

The user passes a single kubeconfig (--kubeconfig) pointing at any one peer. The tool lists Secrets in the operator namespace matching app.kubernetes.io/component=multicluster-kubeconfig-cache (a new label added in this PR), parses each cached peer kubeconfig, and proceeds against all peers. No pre-merging required.

In environments where the cached peer kubeconfigs have server URLs that aren't reachable from the operator (e.g. local stretch dev envs running vClusters inside a k3d host), pass all contexts explicitly with repeated --context — the bundle command sees len(starting) > 1, skips discovery, and uses only what was given.

Metrics scraping

/metrics is gated by controller-runtime's filters.WithAuthenticationAndAuthorization, so apiserver pod-proxy doesn't authenticate. The bundle instead port-forwards into the operator pod and presents a Bearer token minted via TokenRequest for the operator ServiceAccount. The chart's existing *-metrics-reader ClusterRole (nonResourceURLs: ["/metrics"]) is now bound to that SA so the scrape returns 200 instead of 403 — applies to both operator/chart (single-cluster) and operator/multicluster (multicluster).

By default, two samples are taken 10s apart per cluster (matches rpk debug bundle's default). Multiple samples let an investigator compute counter rate-of-change post-hoc without needing a live Prometheus — particularly useful for rate-style raft signals (leader changes, send errors, drops, RTT). Samples land at metrics/t0_metrics.txt, t1_metrics.txt, … and the manifest records metricsSamples + metricsIntervalSeconds so a bundle reader can interpret the layout.

TLSCheck behavior

TLSCheck is a snapshot read (Get) rather than WaitFor-style polling. A debug bundle should never block on cert-manager finishing — if the TLS secret is missing or mid-rotation, that is itself a diagnostic and surfaces in checks.json. The check now falls straight from deployment-volume matching to scan-by-suffix listing, dropping the brittle synthesis of <SecretPrefix>-multicluster-certificates (which produced invalid Secret names when SecretPrefix was a kube context name like vcluster_..._k3d-harpoon).

Flags

  • --kubeconfig / --context / --namespace / --service-name (shared connection flags; --service-name is the app.kubernetes.io/name label value, default operator)
  • -o, --output — output path (defaults to ./operator-bundle-<unix-ts>.zip)
  • --include-private-keys — keep tls.key and kubeconfig.yaml payloads
  • --skip-logs / --logs-size-limit (default 5M, supports MiB / GiB) / --logs-tail-lines (default 5000)
  • --skip-metrics / --metrics-samples (default 2, must be ≥ 2) / --metrics-interval (default 10s)
  • -v, --verbose — stderr progress trace with timestamps for diagnosing hangs

hidalgopl and others added 12 commits May 7, 2026 16:46
Each multicluster operator caches its peers' kubeconfigs as
Secrets in the local cluster (named "<KubeconfigName>-<peerName>"
in KubeconfigNamespace) so a new raft leader can re-engage peers
without a live gRPC call. Today these Secrets carry no labels, so
any external tool that wants to enumerate them has to know the
operator's --kubeconfig-name flag to build the right name prefix.
That couples the consumer to a per-deploy CLI flag and makes
mixed-version deploys awkward.

Apply a stable label set on every CreateOrUpdate of a cache
Secret:

  app.kubernetes.io/component  = multicluster-kubeconfig-cache
  app.kubernetes.io/managed-by = redpanda-multicluster-operator
  cluster.redpanda.com/multicluster-peer = <peerName>

The component label is the discovery selector. The managed-by
label is informational only -- it surfaces ownership in
"kubectl get secret -L". The peer label records the peer name as
data so a reader doesn't have to derive it from the Secret's
name. Labels are re-applied on every reconcile so they self-heal
when an older operator version, or a user, leaves unlabelled
Secrets behind.

writeCachedKubeconfig now takes peerName as an explicit parameter
instead of trying to recover it from the existing secret name, so
the same value lands in both the data path and the label.

Consumed by an upcoming "rpk k8s multicluster bundle" subcommand
to enumerate peer kubeconfigs from a single starting cluster
without needing to parse the operator's deployment args. Constants
are exported so the writer side here and the future reader side
can't drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bundle subcommand introduced in a follow-up commit needs to
look up cache Secrets by name and parse the kubeconfig bytes they
hold. Today these helpers (kubeconfigCacheSecretName,
loadKubeconfig, loadKubeconfigFromBytes) live as package-private
functions inside pkg/multicluster, so any reader outside the
package would have to re-implement them and risk drifting from
the writer side.

Promote them to exported names with no behavior change:

  kubeconfigCacheSecretName -> KubeconfigCacheSecretName
  loadKubeconfig            -> LoadKubeconfig
  loadKubeconfigFromBytes   -> LoadKubeconfigFromBytes

In-package callers updated, plus a stale comment reference in
raft_bootstrap_test.go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The status subcommand prints a snapshot of multicluster operator
health to stdout. For Support tickets we want the same data
captured to a ZIP file so it can be attached and inspected offline,
analogous to how rpk debug bundle relates to the live broker admin
API. This is the operator-side counterpart.

The new "rpk k8s multicluster bundle" subcommand reuses the
existing per-cluster and cross-cluster check pipeline (so anything
status surfaces also lands in the bundle) and serialises the state
each check accumulates on its CheckContext into a structured zip:

  operator-bundle-<unix-ts>.zip
  ├── manifest.json          # schema version, timestamp, namespace,
  │                          # serviceName, includePrivateKeys, clusters
  ├── status.txt             # the same human-readable table the
  │                          # status command prints
  ├── errors.txt             # non-fatal collection failures, omitted
  │                          # when nothing went wrong
  ├── clusters/<peer>/
  │   ├── checks.json        # []checks.Result from RunClusterChecks
  │   ├── pod.yaml           # operator Pod, managedFields stripped
  │   ├── deployment.yaml    # operator Deployment, managedFields stripped
  │   ├── deploy-args.txt    # container args, one per line
  │   ├── tls/
  │   │   ├── ca.crt         # PEM, public material
  │   │   ├── tls.crt        # PEM, public material
  │   │   ├── tls-secret.yaml          # full Secret, tls.key redacted
  │   │   └── tls-key-match.txt
  │   └── raft-status.json
  └── cross-cluster/checks.json

By default only one --kubeconfig (or --context) is required. The
command discovers peer clusters from the labelled cache Secrets
that the operator's raft-bootstrap flow writes onto each cluster,
which means the tool keeps working when one peer is down: pick any
reachable starting cluster and the rest of the roster falls out of
its apiserver. Multi-context input is still honoured and bypasses
discovery, useful when the cache is empty or a cluster is excluded
on purpose.

Per-cluster collection failures (a peer apiserver unreachable, a
malformed cache Secret) are recorded in errors.txt and the bundle
proceeds with whatever it could collect, mirroring rpk debug
bundle's "best-effort" semantics. Failures that prevent any output
(can't open the zip, can't resolve the starting connection) are
returned to the caller.

Sensitive material is redacted by default. Both tls.key and any
cached peer kubeconfig.yaml in serialised Secrets become "REDACTED"
unless --include-private-keys is passed; managedFields are
stripped unconditionally. The kubeconfig redaction in particular
matters because the starting cluster's apiserver holds credentials
to every peer; the bundle must not leak them.

Tests:

  - TestRedactSecret pins the redaction defaults (pure unit test,
    no envtest).
  - TestIntegrationBundleRun_RoundTrip writes a complete bundle
    against a single envtest with no operator deployed, and
    asserts file presence, manifest fields, and that errors.txt
    is omitted when no errors occurred.
  - TestIntegrationBundleDiscoverPeers seeds two cache Secrets on
    a starting envtest (one well-formed pointing at a real second
    envtest, one malformed missing the peer label) and verifies
    discovery returns the well-formed peer as a working roster
    entry while recording the malformed Secret by name in
    errors.txt.

The integration tests use a deliberate subset of cluster checks
(PodCheck + DeploymentCheck) via the new BundleConfig.ClusterChecks
override; the omitted checks (TLSCheck, RaftCheck, etc.) use
WaitFor / port-forward and would dominate test wall-clock time
against a vanilla envtest with no operator pod. Production
defaults are unchanged.

Phase 2 (operator pod logs) and Phase 3 (operator metrics) are
deferred to follow-up commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 captured the operator's check state but not the actual log
output of the operator container, which is by far the most useful
artifact when debugging a deployment that's misbehaving in flight.
This change adds per-container log collection on top of the same
roster the bundle already builds.

For every pod recorded on a CheckContext (the operator pod found
by PodCheck), the bundle now fetches:

  - the current log of every init container and main container
  - the *previous* log of any container whose RestartCount > 0,
    so a freshly-CrashLooping pod surfaces both its current
    pre-crash output and the prior instance's tail

and writes them to clusters/<peer>/logs/<container>.log (and
.previous.log when applicable). Per-container fetches are
independent — a single failing container is recorded in
errors.txt and the next container is attempted, so a permission
gap on one pod can't lose the whole bundle.

The implementation goes through a small logFetcher abstraction
rather than calling kubernetes.Interface directly. Production
builds a fetcher from each connection's REST config; tests
inject a fake because envtest's apiserver doesn't run kubelets
and cannot return real container output. BundleConfig exposes
LogFetcherFor for that injection point.

Three new flags, naming aligned with `rpk debug bundle`:

  --skip-logs                  disable log collection entirely
  --logs-size-limit  string    per-container byte cap (default
                               "5M"; "0" disables the cap)
  --logs-tail-lines  int64     per-container line cap (default
                               5000; 0 disables the cap)

Size parsing accepts both decimal SI ("5M", "1G") and binary
IEC ("5Mi", "1GiB") suffixes — the dispatcher at parseLogsSize
routes binary forms through units.RAMInBytes (1024-based) and
decimal forms through units.FromHumanSize (1000-based), since
RAMInBytes silently treats "5M" as 5 MiB. This matters because
Kubernetes resource limits and `rpk debug bundle` use opposite
conventions and users will mix them.

manifest.json gains three fields recording the policy that was
applied (logsCollected, logsLimitBytes, logsTailLines) so a
reader can tell at a glance whether logs were skipped, capped,
or truncated. SchemaVersion stays at 1 because the additions
are backward-compatible (older readers ignore unknown fields).

Tests:

  - TestCollectClusterLogs_HappyPath: fake logFetcher with init +
    two main containers (one with RestartCount > 0); asserts the
    expected file set, that LimitBytes / TailLines propagate, and
    that current vs previous fetches map to the right filenames.
  - TestCollectClusterLogs_PerContainerErrorIsRecorded: one
    container fails, the other succeeds; the failing container's
    log is absent, the succeeding container's is present, and the
    failure is in the returned []error.
  - TestCollectClusterLogs_NoPodNoOp: cc.Pod == nil short-circuits
    cleanly (no panic, no error).
  - TestResolveLogsOptions: table-driven; covers default, decimal
    suffix ("10MB"), binary suffix ("10MiB"), explicit zero,
    negative-tail clamping, and unparseable input.
  - TestIntegrationBundleRun_RoundTrip extended to assert the
    new manifest fields and that no logs/ entries appear when
    PodCheck found no pod (cc.Pod is nil).

operator/go.mod: github.com/docker/go-units moves from indirect
to direct (parser dependency).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bundle now scrapes the multicluster operator's /metrics
endpoint via the apiserver pod-proxy and writes the raw
Prometheus exposition into clusters/<peer>/metrics/metrics.txt.

Why apiserver pod-proxy rather than port-forward + direct GET:
the operator's metrics server runs with controller-runtime's
filters.WithAuthenticationAndAuthorization filter, which
performs a TokenReview / SubjectAccessReview on every request.
Going through the apiserver proxy lets the bundle command
inherit the kubeconfig identity (the bearer token or client
cert is what the apiserver sees) without rebuilding the TLS
trust chain or replicating the auth dance. It also handles
both the plain-HTTP and HTTPS variants of the metrics server
uniformly: the apiserver terminates TLS on our behalf and the
bundle just GETs /metrics on the proxy URL.

The proxy URL needs the right scheme prefix though, because
when the metrics server is HTTPS the apiserver has to know
to handshake outbound. We detect this by parsing
--metrics-cert-path out of CheckContext.DeployArgs (already
populated by DeploymentCheck): when present, the prefix is
"https:<pod>:<port>"; otherwise "http:<pod>:<port>". The port
itself comes from --metrics-bind-address; absence of that flag
means metrics are disabled and we silently skip the cluster
rather than producing an error.

Per-cluster scrape failures (RBAC denial on pods/proxy, the
operator pod not exposing the metrics port on the expected
container, transient apiserver errors) are recorded in
errors.txt and don't fail the bundle. metrics.txt is omitted
for that cluster but the rest of the bundle is still produced.

New flag:

  --skip-metrics    skip /metrics scraping entirely

manifest.json gains a metricsCollected field recording
whether scraping was enabled on this run.

Tests:

  - TestCollectClusterMetrics_HappyPath: fake fetcher; asserts
    the canned exposition lands at clusters/<ctx>/metrics/metrics.txt
    and that the namespace/pod/scheme/port arguments propagate.
  - TestCollectClusterMetrics_HTTPSWhenCertPathSet:
    --metrics-cert-path in DeployArgs flips the scheme to
    "https".
  - TestCollectClusterMetrics_MissingFlagSkipsCleanly: deploy
    args without --metrics-bind-address must not call the
    fetcher and must not produce an error.
  - TestCollectClusterMetrics_ErrorIsRecorded: fetcher errors
    surface in the returned []error and metrics.txt is absent
    from the zip.
  - TestParseMetricsPort: table-driven; default ":8443",
    explicit host:port, missing flag, empty value, garbage
    value, port out of range.
  - TestMetricsScheme: HTTP without --metrics-cert-path,
    HTTPS with it.
  - TestIntegrationBundleRun_RoundTrip extended to assert the
    new manifest field and that no metrics/ entries appear
    when PodCheck found no pod.

The multicluster operator's RBAC must include the "pods/proxy"
verb in the operator's namespace for this to work in
production. The chart's existing rpk-debug-bundle role doesn't
cover it, so deployments may need an additional Role; tracking
that as a separate change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 scraped /metrics through the apiserver pod-proxy. That
turned out not to work against the multicluster operator: the
operator's metrics server runs with controller-runtime's
filters.WithAuthenticationAndAuthorization, and the apiserver
pod-proxy opens a new HTTP connection to the pod *without*
propagating the user's auth headers. The metrics server sees an
unauthenticated request and returns 401, surfaced to client-go as
"the server has asked for the client to provide credentials".

This is a fundamental incompatibility between the apiserver
pod-proxy (which doesn't carry per-request auth) and any endpoint
fronted by an authn/authz filter — not something the operator did
wrong, and not something we can fix from rpk's side without a
different transport.

Replace the apiserver-proxy fetcher with a port-forward + Bearer
token approach:

  1. cs.CoreV1().Pods(ns).Get to find the pod (we need its
     ServiceAccountName).
  2. kube.Ctl.PortForward to map the metrics containerPort to a
     local port. The chart's multicluster deployment template
     already declares port 8443 as containerPort:8443/name:https,
     so PortForward picks it up automatically.
  3. cs.CoreV1().ServiceAccounts(ns).CreateToken to mint a
     short-lived (10 minute) Bearer token for the pod's SA via
     the TokenRequest API.
  4. http.Client GET <scheme>://127.0.0.1:<localPort>/metrics
     with Authorization: Bearer <token>. TLSClientConfig
     InsecureSkipVerify because the metrics server typically
     uses a self-signed cert and we're hitting 127.0.0.1 over a
     port-forward — there is no MITM concern.

The metrics server's filter then TokenReview's the Bearer token
against the apiserver, so the SubjectAccessReview happens
against the SA's identity. As long as the SA has nonResourceURLs:
/metrics get (granted by the chart's metrics-reader ClusterRole +
binding, in a follow-up commit), the scrape returns 200 with the
Prometheus exposition. If the SA lacks the binding the scrape
returns 403 and the error in errors.txt now includes the HTTP
status + the first 512 bytes of the body, so users can see
exactly what the metrics server denied.

The metricsFetcher interface is unchanged, so existing
fake-fetcher unit tests continue to apply. The kubeMetricsFetcher
production implementation is the only thing that moved; tests
that exercise its surface (port lookup, error wrapping) would
need a real apiserver and are deferred to a multicluster
integration test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today bundle Run is a black box: it goes off, does network I/O
across N clusters, and returns either silently or with the final
"bundle written to ..." line. When something hangs or feels slow,
you have nothing to look at — even a goroutine dump from SIGQUIT
requires fishing the PID and running another tool.

Add a --verbose / -v flag that prints a one-line progress message
to stderr at every major step:

  - connection resolution
  - peer discovery (with per-peer "discovered" lines and per-
    Secret discovery warnings inline)
  - per-cluster check pipeline, with each check name and elapsed
    duration as it completes — invaluable for "which check
    hung?"
  - cross-cluster checks (single line; they're typically fast)
  - per-cluster log collection and metrics scrape, each with
    elapsed time
  - manifest / status.txt writes

Each line is prefixed with "[bundle HH:MM:SS.mmm]" so you can
eyeball gaps between two log lines and pinpoint where the wall
clock got stuck.

Off by default — clean machine-readable invocations are
unchanged.

Implementation:

  - BundleConfig gains Verbose (bool) and ProgressOut (io.Writer)
    fields. Verbose is bound to the flag; ProgressOut is a test
    hook (defaults to os.Stderr).
  - progressLogger() returns a no-op closure when Verbose is
    false (no per-call cost beyond format-arg evaluation), and
    a stderr writer with timestamp prefix when Verbose is true.
  - The cluster-check loop now inlines what RunClusterChecks did
    so we can wrap each check call in a logf timing pair. The
    per-check Results aggregation behaviour is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bundle subcommand's --logs-size-limit flag promoted
github.com/docker/go-units from indirect to direct in Phase 2
(67f3422), but task generate wasn't run as part of that
commit so third_party.md missed the new entry. Catch it up now;
no other licenses-affecting dependency changes in flight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chart already emits a metrics-reader ClusterRole granting
nonResourceURLs:["/metrics"] verbs:["get"] (rbac.go:107-123) but
deliberately leaves it unbound — the comment was "We skip over
making a binding for the metrics viewer role." The intent was
that consumers (Prometheus, etc.) bind it from their own
ServiceAccount.

That intent missed two cases that actually exist in the chart
itself:

  1. The chart ships a ServiceMonitor (_servicemonitor.go.tpl)
     that scrapes /metrics over HTTPS using the operator pod's
     projected SA token — i.e. it scrapes *as the operator's
     own SA*. controller-runtime's metrics server runs with
     filters.WithAuthenticationAndAuthorization, which
     TokenReview's the bearer token and then SubjectAccessReview's
     the resulting identity. Without a binding, that SAR returns
     "no" for the operator SA and the scrape is silently denied.
  2. `rpk k8s multicluster bundle` (in development on this
     branch) authenticates the same way and hits the same 403.

The minimum-privilege fix is a single ClusterRoleBinding from
the operator's own SA to the existing metrics-reader ClusterRole.
The role's only rule is `nonResourceURLs:["/metrics"], verbs:
["get"]` — the smallest possible privilege bump. External
consumers (Prometheus etc.) still bind to the same ClusterRole
from their own SA, so existing deployments are unaffected
beyond the new binding.

The "skip" comment is replaced with a comment explaining the
new behaviour and the fact that other consumers still bind
separately.

Regenerated _rbac.go.tpl via `task generate` and the chart
TestTemplate golden files via `go test ... -run TestTemplate
-update-golden`. (Note: CLAUDE.md says these tests use `-update`,
but the actual flag is `-update-golden`. Worth a doc fix in a
follow-up.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parallel to the chart-side fix in 5ab86fd: the multicluster
operator's per-StretchCluster RBAC scaffolding (rbac.go) didn't
emit the self-metrics ClusterRole or its binding either. Same
two consumers hit the same 403:

  - the chart's bundled ServiceMonitor scrapes /metrics over
    HTTPS using the operator pod's projected SA token,
  - `rpk k8s multicluster bundle` mints a TokenRequest token for
    the operator SA and scrapes /metrics over a port-forward.

Both authenticate as the operator SA and need
nonResourceURLs:["/metrics"] verbs:["get"] in a ClusterRole bound
to that SA.

Add the ClusterRole to clusterRoles() (gated on the existing
RBAC.IsEnabled() guard, like rack-awareness). The binding is
emitted automatically because clusterRoleBindings() iterates
clusterRoles() and binds each to GetServiceAccountName(). One
extra rule, smallest possible privilege bump
("get /metrics", nothing else).

Test goldens regenerated:

  - operator/multicluster/testdata/render-cases.resources.golden.txtar
    via `go test ./operator/multicluster/... -update-golden`
  - operator/internal/lifecycle/testdata/stretch-cluster-cases.resources.golden.txtar
    via `go test ./operator/internal/lifecycle/... -update-golden`
    (the stretch-cluster lifecycle test exercises the same RBAC
    rendering through its render pipeline)

After this lands, fresh deployments via `task
dev:setup-multicluster-dev-env` give every operator pod metrics
access to its own /metrics out of the box. Existing deployments
need a chart upgrade for the binding to materialise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror `rpk debug bundle`'s default of two /metrics samples 10s
apart per cluster. A single Prometheus snapshot loses the
rate-of-change information that's most useful for raft
debugging — leader changes, send errors, drops, RTT histograms
are all rate-style signals where two samples let an investigator
compute deltas post-hoc without needing a live Prometheus.

Layout change inside the bundle:

  clusters/<context>/metrics/metrics.txt
      ↓
  clusters/<context>/metrics/t0_metrics.txt
  clusters/<context>/metrics/t1_metrics.txt

Same file naming convention as `rpk debug bundle`, so anyone
already familiar with that bundle format reads ours the same
way.

New flags:

  --metrics-samples int           # default 2, must be >= 2
  --metrics-interval duration     # default 10s, must be > 0

Validation matches `rpk debug bundle`: a single sample defeats
the point (no delta possible), so anything below 2 is rejected
at flag-parse time before any work is done. Per-sample failures
are recorded in errors.txt and the next sample is still
attempted; the `--metrics-interval` sleep respects context
cancellation so a SIGINT'd bundle exits promptly.

manifest.json grows two fields recording the policy:

  metricsSamples: 2
  metricsIntervalSeconds: 10

so a bundle reader can interpret the t<i>_metrics.txt filenames
without guessing.

Verbose mode (`-v`) now also logs each per-sample scrape
("[<context>] /metrics sample 1/2") which is the difference
between "is the bundle hung" and "we're in the inter-sample
sleep, hold on 10s".

Tests cover the new behaviour: multi-sample happy path with
distinct per-sample bodies landing in distinct files;
per-sample failure that doesn't abort the loop; the existing
single-sample paths updated to assert against `t0_metrics.txt`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues hit at once when running the bundle against a stretch
dev env where the operator pods are deployed under non-default
labels:

  1. `findSecretName` returned "" because the deployment volume
     it expected (`<SecretPrefix>-multicluster-certificates`)
     wasn't on the deployment. The fallback branch then
     synthesised a secret name as
     `<SecretPrefix>-multicluster-certificates` and tried it
     verbatim. With `SecretPrefix` derived from a kube context
     name (e.g. `vcluster_vc-...-0_..._k3d-harpoon`) the
     synthesised name contains underscores — invalid as a
     Kubernetes Secret name and definitely not present in the
     cluster.
  2. The check then called `WaitFor` with a "return nil error"
     condition, which polls indefinitely until the secret
     appears. Against a synthesised, never-present name that
     means the bundle hangs at the first peer's `tls` check
     and never advances.

A debug bundle is a snapshot, not a probe. If the secret is
missing or mid-rotation, that is itself the diagnostic and
belongs in `checks.json` as a Fail() — not as a hang. Replace
`WaitFor` with `Get` so a missing secret fails fast.

Also drop the brittle name synthesis. Fall straight from
deployment-volume matching (the precise path) to
scan-by-suffix listing of all Secrets in the namespace
(the robust fallback). The previous middle step — synthesising
from `SecretPrefix` — only worked when the user happened to
pass the chart's release name as `--service-name`'s
counterpart, and produced obviously-wrong names otherwise.

The existing test commentary in `bundle_test.go:49`
("`TLSCheck` (uses WaitFor which polls until the secret exists
or the context is cancelled — hangs against a vanilla envtest)")
acknowledged the broken behaviour as a known limitation. With
this change the comment's premise no longer holds — TLSCheck
can be re-included in `fastTestChecks` in a follow-up if we want
to expand integration coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hidalgopl hidalgopl marked this pull request as draft May 8, 2026 12:05
@hidalgopl hidalgopl changed the title Pb/multicluster operator debug bundle [rpk k8s-multicluster] operator debug bundle May 8, 2026
@hidalgopl hidalgopl marked this pull request as ready for review May 8, 2026 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant