[rpk k8s-multicluster] operator debug bundle#1504
Open
Conversation
Each multicluster operator caches its peers' kubeconfigs as Secrets in the local cluster (named "<KubeconfigName>-<peerName>" in KubeconfigNamespace) so a new raft leader can re-engage peers without a live gRPC call. Today these Secrets carry no labels, so any external tool that wants to enumerate them has to know the operator's --kubeconfig-name flag to build the right name prefix. That couples the consumer to a per-deploy CLI flag and makes mixed-version deploys awkward. Apply a stable label set on every CreateOrUpdate of a cache Secret: app.kubernetes.io/component = multicluster-kubeconfig-cache app.kubernetes.io/managed-by = redpanda-multicluster-operator cluster.redpanda.com/multicluster-peer = <peerName> The component label is the discovery selector. The managed-by label is informational only -- it surfaces ownership in "kubectl get secret -L". The peer label records the peer name as data so a reader doesn't have to derive it from the Secret's name. Labels are re-applied on every reconcile so they self-heal when an older operator version, or a user, leaves unlabelled Secrets behind. writeCachedKubeconfig now takes peerName as an explicit parameter instead of trying to recover it from the existing secret name, so the same value lands in both the data path and the label. Consumed by an upcoming "rpk k8s multicluster bundle" subcommand to enumerate peer kubeconfigs from a single starting cluster without needing to parse the operator's deployment args. Constants are exported so the writer side here and the future reader side can't drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bundle subcommand introduced in a follow-up commit needs to look up cache Secrets by name and parse the kubeconfig bytes they hold. Today these helpers (kubeconfigCacheSecretName, loadKubeconfig, loadKubeconfigFromBytes) live as package-private functions inside pkg/multicluster, so any reader outside the package would have to re-implement them and risk drifting from the writer side. Promote them to exported names with no behavior change: kubeconfigCacheSecretName -> KubeconfigCacheSecretName loadKubeconfig -> LoadKubeconfig loadKubeconfigFromBytes -> LoadKubeconfigFromBytes In-package callers updated, plus a stale comment reference in raft_bootstrap_test.go. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The status subcommand prints a snapshot of multicluster operator
health to stdout. For Support tickets we want the same data
captured to a ZIP file so it can be attached and inspected offline,
analogous to how rpk debug bundle relates to the live broker admin
API. This is the operator-side counterpart.
The new "rpk k8s multicluster bundle" subcommand reuses the
existing per-cluster and cross-cluster check pipeline (so anything
status surfaces also lands in the bundle) and serialises the state
each check accumulates on its CheckContext into a structured zip:
operator-bundle-<unix-ts>.zip
├── manifest.json # schema version, timestamp, namespace,
│ # serviceName, includePrivateKeys, clusters
├── status.txt # the same human-readable table the
│ # status command prints
├── errors.txt # non-fatal collection failures, omitted
│ # when nothing went wrong
├── clusters/<peer>/
│ ├── checks.json # []checks.Result from RunClusterChecks
│ ├── pod.yaml # operator Pod, managedFields stripped
│ ├── deployment.yaml # operator Deployment, managedFields stripped
│ ├── deploy-args.txt # container args, one per line
│ ├── tls/
│ │ ├── ca.crt # PEM, public material
│ │ ├── tls.crt # PEM, public material
│ │ ├── tls-secret.yaml # full Secret, tls.key redacted
│ │ └── tls-key-match.txt
│ └── raft-status.json
└── cross-cluster/checks.json
By default only one --kubeconfig (or --context) is required. The
command discovers peer clusters from the labelled cache Secrets
that the operator's raft-bootstrap flow writes onto each cluster,
which means the tool keeps working when one peer is down: pick any
reachable starting cluster and the rest of the roster falls out of
its apiserver. Multi-context input is still honoured and bypasses
discovery, useful when the cache is empty or a cluster is excluded
on purpose.
Per-cluster collection failures (a peer apiserver unreachable, a
malformed cache Secret) are recorded in errors.txt and the bundle
proceeds with whatever it could collect, mirroring rpk debug
bundle's "best-effort" semantics. Failures that prevent any output
(can't open the zip, can't resolve the starting connection) are
returned to the caller.
Sensitive material is redacted by default. Both tls.key and any
cached peer kubeconfig.yaml in serialised Secrets become "REDACTED"
unless --include-private-keys is passed; managedFields are
stripped unconditionally. The kubeconfig redaction in particular
matters because the starting cluster's apiserver holds credentials
to every peer; the bundle must not leak them.
Tests:
- TestRedactSecret pins the redaction defaults (pure unit test,
no envtest).
- TestIntegrationBundleRun_RoundTrip writes a complete bundle
against a single envtest with no operator deployed, and
asserts file presence, manifest fields, and that errors.txt
is omitted when no errors occurred.
- TestIntegrationBundleDiscoverPeers seeds two cache Secrets on
a starting envtest (one well-formed pointing at a real second
envtest, one malformed missing the peer label) and verifies
discovery returns the well-formed peer as a working roster
entry while recording the malformed Secret by name in
errors.txt.
The integration tests use a deliberate subset of cluster checks
(PodCheck + DeploymentCheck) via the new BundleConfig.ClusterChecks
override; the omitted checks (TLSCheck, RaftCheck, etc.) use
WaitFor / port-forward and would dominate test wall-clock time
against a vanilla envtest with no operator pod. Production
defaults are unchanged.
Phase 2 (operator pod logs) and Phase 3 (operator metrics) are
deferred to follow-up commits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 captured the operator's check state but not the actual log
output of the operator container, which is by far the most useful
artifact when debugging a deployment that's misbehaving in flight.
This change adds per-container log collection on top of the same
roster the bundle already builds.
For every pod recorded on a CheckContext (the operator pod found
by PodCheck), the bundle now fetches:
- the current log of every init container and main container
- the *previous* log of any container whose RestartCount > 0,
so a freshly-CrashLooping pod surfaces both its current
pre-crash output and the prior instance's tail
and writes them to clusters/<peer>/logs/<container>.log (and
.previous.log when applicable). Per-container fetches are
independent — a single failing container is recorded in
errors.txt and the next container is attempted, so a permission
gap on one pod can't lose the whole bundle.
The implementation goes through a small logFetcher abstraction
rather than calling kubernetes.Interface directly. Production
builds a fetcher from each connection's REST config; tests
inject a fake because envtest's apiserver doesn't run kubelets
and cannot return real container output. BundleConfig exposes
LogFetcherFor for that injection point.
Three new flags, naming aligned with `rpk debug bundle`:
--skip-logs disable log collection entirely
--logs-size-limit string per-container byte cap (default
"5M"; "0" disables the cap)
--logs-tail-lines int64 per-container line cap (default
5000; 0 disables the cap)
Size parsing accepts both decimal SI ("5M", "1G") and binary
IEC ("5Mi", "1GiB") suffixes — the dispatcher at parseLogsSize
routes binary forms through units.RAMInBytes (1024-based) and
decimal forms through units.FromHumanSize (1000-based), since
RAMInBytes silently treats "5M" as 5 MiB. This matters because
Kubernetes resource limits and `rpk debug bundle` use opposite
conventions and users will mix them.
manifest.json gains three fields recording the policy that was
applied (logsCollected, logsLimitBytes, logsTailLines) so a
reader can tell at a glance whether logs were skipped, capped,
or truncated. SchemaVersion stays at 1 because the additions
are backward-compatible (older readers ignore unknown fields).
Tests:
- TestCollectClusterLogs_HappyPath: fake logFetcher with init +
two main containers (one with RestartCount > 0); asserts the
expected file set, that LimitBytes / TailLines propagate, and
that current vs previous fetches map to the right filenames.
- TestCollectClusterLogs_PerContainerErrorIsRecorded: one
container fails, the other succeeds; the failing container's
log is absent, the succeeding container's is present, and the
failure is in the returned []error.
- TestCollectClusterLogs_NoPodNoOp: cc.Pod == nil short-circuits
cleanly (no panic, no error).
- TestResolveLogsOptions: table-driven; covers default, decimal
suffix ("10MB"), binary suffix ("10MiB"), explicit zero,
negative-tail clamping, and unparseable input.
- TestIntegrationBundleRun_RoundTrip extended to assert the
new manifest fields and that no logs/ entries appear when
PodCheck found no pod (cc.Pod is nil).
operator/go.mod: github.com/docker/go-units moves from indirect
to direct (parser dependency).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bundle now scrapes the multicluster operator's /metrics
endpoint via the apiserver pod-proxy and writes the raw
Prometheus exposition into clusters/<peer>/metrics/metrics.txt.
Why apiserver pod-proxy rather than port-forward + direct GET:
the operator's metrics server runs with controller-runtime's
filters.WithAuthenticationAndAuthorization filter, which
performs a TokenReview / SubjectAccessReview on every request.
Going through the apiserver proxy lets the bundle command
inherit the kubeconfig identity (the bearer token or client
cert is what the apiserver sees) without rebuilding the TLS
trust chain or replicating the auth dance. It also handles
both the plain-HTTP and HTTPS variants of the metrics server
uniformly: the apiserver terminates TLS on our behalf and the
bundle just GETs /metrics on the proxy URL.
The proxy URL needs the right scheme prefix though, because
when the metrics server is HTTPS the apiserver has to know
to handshake outbound. We detect this by parsing
--metrics-cert-path out of CheckContext.DeployArgs (already
populated by DeploymentCheck): when present, the prefix is
"https:<pod>:<port>"; otherwise "http:<pod>:<port>". The port
itself comes from --metrics-bind-address; absence of that flag
means metrics are disabled and we silently skip the cluster
rather than producing an error.
Per-cluster scrape failures (RBAC denial on pods/proxy, the
operator pod not exposing the metrics port on the expected
container, transient apiserver errors) are recorded in
errors.txt and don't fail the bundle. metrics.txt is omitted
for that cluster but the rest of the bundle is still produced.
New flag:
--skip-metrics skip /metrics scraping entirely
manifest.json gains a metricsCollected field recording
whether scraping was enabled on this run.
Tests:
- TestCollectClusterMetrics_HappyPath: fake fetcher; asserts
the canned exposition lands at clusters/<ctx>/metrics/metrics.txt
and that the namespace/pod/scheme/port arguments propagate.
- TestCollectClusterMetrics_HTTPSWhenCertPathSet:
--metrics-cert-path in DeployArgs flips the scheme to
"https".
- TestCollectClusterMetrics_MissingFlagSkipsCleanly: deploy
args without --metrics-bind-address must not call the
fetcher and must not produce an error.
- TestCollectClusterMetrics_ErrorIsRecorded: fetcher errors
surface in the returned []error and metrics.txt is absent
from the zip.
- TestParseMetricsPort: table-driven; default ":8443",
explicit host:port, missing flag, empty value, garbage
value, port out of range.
- TestMetricsScheme: HTTP without --metrics-cert-path,
HTTPS with it.
- TestIntegrationBundleRun_RoundTrip extended to assert the
new manifest field and that no metrics/ entries appear
when PodCheck found no pod.
The multicluster operator's RBAC must include the "pods/proxy"
verb in the operator's namespace for this to work in
production. The chart's existing rpk-debug-bundle role doesn't
cover it, so deployments may need an additional Role; tracking
that as a separate change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 scraped /metrics through the apiserver pod-proxy. That
turned out not to work against the multicluster operator: the
operator's metrics server runs with controller-runtime's
filters.WithAuthenticationAndAuthorization, and the apiserver
pod-proxy opens a new HTTP connection to the pod *without*
propagating the user's auth headers. The metrics server sees an
unauthenticated request and returns 401, surfaced to client-go as
"the server has asked for the client to provide credentials".
This is a fundamental incompatibility between the apiserver
pod-proxy (which doesn't carry per-request auth) and any endpoint
fronted by an authn/authz filter — not something the operator did
wrong, and not something we can fix from rpk's side without a
different transport.
Replace the apiserver-proxy fetcher with a port-forward + Bearer
token approach:
1. cs.CoreV1().Pods(ns).Get to find the pod (we need its
ServiceAccountName).
2. kube.Ctl.PortForward to map the metrics containerPort to a
local port. The chart's multicluster deployment template
already declares port 8443 as containerPort:8443/name:https,
so PortForward picks it up automatically.
3. cs.CoreV1().ServiceAccounts(ns).CreateToken to mint a
short-lived (10 minute) Bearer token for the pod's SA via
the TokenRequest API.
4. http.Client GET <scheme>://127.0.0.1:<localPort>/metrics
with Authorization: Bearer <token>. TLSClientConfig
InsecureSkipVerify because the metrics server typically
uses a self-signed cert and we're hitting 127.0.0.1 over a
port-forward — there is no MITM concern.
The metrics server's filter then TokenReview's the Bearer token
against the apiserver, so the SubjectAccessReview happens
against the SA's identity. As long as the SA has nonResourceURLs:
/metrics get (granted by the chart's metrics-reader ClusterRole +
binding, in a follow-up commit), the scrape returns 200 with the
Prometheus exposition. If the SA lacks the binding the scrape
returns 403 and the error in errors.txt now includes the HTTP
status + the first 512 bytes of the body, so users can see
exactly what the metrics server denied.
The metricsFetcher interface is unchanged, so existing
fake-fetcher unit tests continue to apply. The kubeMetricsFetcher
production implementation is the only thing that moved; tests
that exercise its surface (port lookup, error wrapping) would
need a real apiserver and are deferred to a multicluster
integration test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today bundle Run is a black box: it goes off, does network I/O
across N clusters, and returns either silently or with the final
"bundle written to ..." line. When something hangs or feels slow,
you have nothing to look at — even a goroutine dump from SIGQUIT
requires fishing the PID and running another tool.
Add a --verbose / -v flag that prints a one-line progress message
to stderr at every major step:
- connection resolution
- peer discovery (with per-peer "discovered" lines and per-
Secret discovery warnings inline)
- per-cluster check pipeline, with each check name and elapsed
duration as it completes — invaluable for "which check
hung?"
- cross-cluster checks (single line; they're typically fast)
- per-cluster log collection and metrics scrape, each with
elapsed time
- manifest / status.txt writes
Each line is prefixed with "[bundle HH:MM:SS.mmm]" so you can
eyeball gaps between two log lines and pinpoint where the wall
clock got stuck.
Off by default — clean machine-readable invocations are
unchanged.
Implementation:
- BundleConfig gains Verbose (bool) and ProgressOut (io.Writer)
fields. Verbose is bound to the flag; ProgressOut is a test
hook (defaults to os.Stderr).
- progressLogger() returns a no-op closure when Verbose is
false (no per-call cost beyond format-arg evaluation), and
a stderr writer with timestamp prefix when Verbose is true.
- The cluster-check loop now inlines what RunClusterChecks did
so we can wrap each check call in a logf timing pair. The
per-check Results aggregation behaviour is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bundle subcommand's --logs-size-limit flag promoted github.com/docker/go-units from indirect to direct in Phase 2 (67f3422), but task generate wasn't run as part of that commit so third_party.md missed the new entry. Catch it up now; no other licenses-affecting dependency changes in flight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chart already emits a metrics-reader ClusterRole granting
nonResourceURLs:["/metrics"] verbs:["get"] (rbac.go:107-123) but
deliberately leaves it unbound — the comment was "We skip over
making a binding for the metrics viewer role." The intent was
that consumers (Prometheus, etc.) bind it from their own
ServiceAccount.
That intent missed two cases that actually exist in the chart
itself:
1. The chart ships a ServiceMonitor (_servicemonitor.go.tpl)
that scrapes /metrics over HTTPS using the operator pod's
projected SA token — i.e. it scrapes *as the operator's
own SA*. controller-runtime's metrics server runs with
filters.WithAuthenticationAndAuthorization, which
TokenReview's the bearer token and then SubjectAccessReview's
the resulting identity. Without a binding, that SAR returns
"no" for the operator SA and the scrape is silently denied.
2. `rpk k8s multicluster bundle` (in development on this
branch) authenticates the same way and hits the same 403.
The minimum-privilege fix is a single ClusterRoleBinding from
the operator's own SA to the existing metrics-reader ClusterRole.
The role's only rule is `nonResourceURLs:["/metrics"], verbs:
["get"]` — the smallest possible privilege bump. External
consumers (Prometheus etc.) still bind to the same ClusterRole
from their own SA, so existing deployments are unaffected
beyond the new binding.
The "skip" comment is replaced with a comment explaining the
new behaviour and the fact that other consumers still bind
separately.
Regenerated _rbac.go.tpl via `task generate` and the chart
TestTemplate golden files via `go test ... -run TestTemplate
-update-golden`. (Note: CLAUDE.md says these tests use `-update`,
but the actual flag is `-update-golden`. Worth a doc fix in a
follow-up.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parallel to the chart-side fix in 5ab86fd: the multicluster operator's per-StretchCluster RBAC scaffolding (rbac.go) didn't emit the self-metrics ClusterRole or its binding either. Same two consumers hit the same 403: - the chart's bundled ServiceMonitor scrapes /metrics over HTTPS using the operator pod's projected SA token, - `rpk k8s multicluster bundle` mints a TokenRequest token for the operator SA and scrapes /metrics over a port-forward. Both authenticate as the operator SA and need nonResourceURLs:["/metrics"] verbs:["get"] in a ClusterRole bound to that SA. Add the ClusterRole to clusterRoles() (gated on the existing RBAC.IsEnabled() guard, like rack-awareness). The binding is emitted automatically because clusterRoleBindings() iterates clusterRoles() and binds each to GetServiceAccountName(). One extra rule, smallest possible privilege bump ("get /metrics", nothing else). Test goldens regenerated: - operator/multicluster/testdata/render-cases.resources.golden.txtar via `go test ./operator/multicluster/... -update-golden` - operator/internal/lifecycle/testdata/stretch-cluster-cases.resources.golden.txtar via `go test ./operator/internal/lifecycle/... -update-golden` (the stretch-cluster lifecycle test exercises the same RBAC rendering through its render pipeline) After this lands, fresh deployments via `task dev:setup-multicluster-dev-env` give every operator pod metrics access to its own /metrics out of the box. Existing deployments need a chart upgrade for the binding to materialise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror `rpk debug bundle`'s default of two /metrics samples 10s
apart per cluster. A single Prometheus snapshot loses the
rate-of-change information that's most useful for raft
debugging — leader changes, send errors, drops, RTT histograms
are all rate-style signals where two samples let an investigator
compute deltas post-hoc without needing a live Prometheus.
Layout change inside the bundle:
clusters/<context>/metrics/metrics.txt
↓
clusters/<context>/metrics/t0_metrics.txt
clusters/<context>/metrics/t1_metrics.txt
Same file naming convention as `rpk debug bundle`, so anyone
already familiar with that bundle format reads ours the same
way.
New flags:
--metrics-samples int # default 2, must be >= 2
--metrics-interval duration # default 10s, must be > 0
Validation matches `rpk debug bundle`: a single sample defeats
the point (no delta possible), so anything below 2 is rejected
at flag-parse time before any work is done. Per-sample failures
are recorded in errors.txt and the next sample is still
attempted; the `--metrics-interval` sleep respects context
cancellation so a SIGINT'd bundle exits promptly.
manifest.json grows two fields recording the policy:
metricsSamples: 2
metricsIntervalSeconds: 10
so a bundle reader can interpret the t<i>_metrics.txt filenames
without guessing.
Verbose mode (`-v`) now also logs each per-sample scrape
("[<context>] /metrics sample 1/2") which is the difference
between "is the bundle hung" and "we're in the inter-sample
sleep, hold on 10s".
Tests cover the new behaviour: multi-sample happy path with
distinct per-sample bodies landing in distinct files;
per-sample failure that doesn't abort the loop; the existing
single-sample paths updated to assert against `t0_metrics.txt`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues hit at once when running the bundle against a stretch
dev env where the operator pods are deployed under non-default
labels:
1. `findSecretName` returned "" because the deployment volume
it expected (`<SecretPrefix>-multicluster-certificates`)
wasn't on the deployment. The fallback branch then
synthesised a secret name as
`<SecretPrefix>-multicluster-certificates` and tried it
verbatim. With `SecretPrefix` derived from a kube context
name (e.g. `vcluster_vc-...-0_..._k3d-harpoon`) the
synthesised name contains underscores — invalid as a
Kubernetes Secret name and definitely not present in the
cluster.
2. The check then called `WaitFor` with a "return nil error"
condition, which polls indefinitely until the secret
appears. Against a synthesised, never-present name that
means the bundle hangs at the first peer's `tls` check
and never advances.
A debug bundle is a snapshot, not a probe. If the secret is
missing or mid-rotation, that is itself the diagnostic and
belongs in `checks.json` as a Fail() — not as a hang. Replace
`WaitFor` with `Get` so a missing secret fails fast.
Also drop the brittle name synthesis. Fall straight from
deployment-volume matching (the precise path) to
scan-by-suffix listing of all Secrets in the namespace
(the robust fallback). The previous middle step — synthesising
from `SecretPrefix` — only worked when the user happened to
pass the chart's release name as `--service-name`'s
counterpart, and produced obviously-wrong names otherwise.
The existing test commentary in `bundle_test.go:49`
("`TLSCheck` (uses WaitFor which polls until the secret exists
or the context is cancelled — hangs against a vanilla envtest)")
acknowledged the broken behaviour as a known limitation. With
this change the comment's premise no longer holds — TLSCheck
can be re-included in `fastTestChecks` in a follow-up if we want
to expand integration coverage.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
rpk-k8s multicluster bundle, a diagnostics tool for the multicluster operator analogous torpk debug bundlefor Redpanda. Given a single kubeconfig pointing at one of the operator's host clusters, it discovers the remaining peers via labeled cache Secrets, then collects per-cluster artifacts (operator pod manifest, deployment, logs, metrics, raft status, TLS material) and runs the existingmulticluster checkspipeline as both per-cluster and cross-cluster checks. Output is a single zip archive.This is the operator-side complement to
rpk debug remote-bundle— useful for triage of stretch clusters where you need a snapshot of all three operator instances in one shot.What's in the bundle
(Sub-directory names under
clusters/are the kube context names from the discovered kubeconfigs —peer-0/peer-1/peer-2shown above for readability.)managedFieldsare stripped from every Kubernetes object.tls.keyandkubeconfig.yamlSecret payloads are redacted unless--include-private-keysis passed. Per-artifact failures are written toerrors.txtrather than aborting the bundle.Discovery
The user passes a single kubeconfig (
--kubeconfig) pointing at any one peer. The tool lists Secrets in the operator namespace matchingapp.kubernetes.io/component=multicluster-kubeconfig-cache(a new label added in this PR), parses each cached peer kubeconfig, and proceeds against all peers. No pre-merging required.In environments where the cached peer kubeconfigs have server URLs that aren't reachable from the operator (e.g. local stretch dev envs running vClusters inside a k3d host), pass all contexts explicitly with repeated
--context— the bundle command seeslen(starting) > 1, skips discovery, and uses only what was given.Metrics scraping
/metricsis gated by controller-runtime'sfilters.WithAuthenticationAndAuthorization, so apiserver pod-proxy doesn't authenticate. The bundle instead port-forwards into the operator pod and presents a Bearer token minted via TokenRequest for the operator ServiceAccount. The chart's existing*-metrics-readerClusterRole (nonResourceURLs: ["/metrics"]) is now bound to that SA so the scrape returns 200 instead of 403 — applies to bothoperator/chart(single-cluster) andoperator/multicluster(multicluster).By default, two samples are taken 10s apart per cluster (matches
rpk debug bundle's default). Multiple samples let an investigator compute counter rate-of-change post-hoc without needing a live Prometheus — particularly useful for rate-style raft signals (leader changes, send errors, drops, RTT). Samples land atmetrics/t0_metrics.txt,t1_metrics.txt, … and the manifest recordsmetricsSamples+metricsIntervalSecondsso a bundle reader can interpret the layout.TLSCheck behavior
TLSCheckis a snapshot read (Get) rather thanWaitFor-style polling. A debug bundle should never block on cert-manager finishing — if the TLS secret is missing or mid-rotation, that is itself a diagnostic and surfaces inchecks.json. The check now falls straight from deployment-volume matching to scan-by-suffix listing, dropping the brittle synthesis of<SecretPrefix>-multicluster-certificates(which produced invalid Secret names whenSecretPrefixwas a kube context name likevcluster_..._k3d-harpoon).Flags
--kubeconfig/--context/--namespace/--service-name(shared connection flags;--service-nameis theapp.kubernetes.io/namelabel value, defaultoperator)-o, --output— output path (defaults to./operator-bundle-<unix-ts>.zip)--include-private-keys— keeptls.keyandkubeconfig.yamlpayloads--skip-logs/--logs-size-limit(default5M, supportsMiB/GiB) /--logs-tail-lines(default5000)--skip-metrics/--metrics-samples(default2, must be ≥ 2) /--metrics-interval(default10s)-v, --verbose— stderr progress trace with timestamps for diagnosing hangs