[CORE-13707] `redpanda`: the great config bootstrap fix by WillemKauf · Pull Request #30396 · redpanda-data/redpanda

WillemKauf · 2026-05-06T22:03:43Z

BLUF

Re-arrangement of the bootstrap process in order to allow restarting nodes to see an up to date view of the cluster wide state before initializing systems that depend on it.

BLUF Continued

Two important cluster wide states include the cluster configuration and feature table views, which we now attempt to fetch via a new fetch_controller_snapshot RPC to the controller leader, similar to a node joining the cluster for the first time.

This is mostly why the config cache exists, as well as why we snapshot member tables and feature tables into the kvstore- to allow a view of these properties BEFORE bootstrapping. However, these local snapshots can always be stale, and it is important to bring the node up to date with the controller leader before the system is started.

This prevents bugs like:

Node goes down for a while, cloud_topics_enabled is set on, a cloud topic is created, node comes back, replays the controller log after bootstrapping itself, doesn't have cloud topics capabilities but creates the topic anyways in an undefined state.
Node goes down for a while, log_compaction_use_sliding_window is set on, node comes back, replays the controller log after bootstrapping itself, sets this property on, but no memory reservation has been properly allocated for the sliding window map, potential seg fault.

These are just two examples of the undefined behaviors that can happen when we bootstrap before establishing a cluster wide view of these configurations and/or features.

We can also no longer promote cluster configs during cluster recovery or during controller log replay after the bootstrapping process - that would be invalid behavior.

Commits

Commits 1-10 are mostly mechanical changes made in advance of a large bootstrap re-ordering in the main redpanda application. They are all necessary to enable certain behaviors we need for that bootstrap refactoring. These could be pulled out into separate PR(s) for reviewing purposes.
Commit 11 is important because when applying controller snapshots, we need to ensure that configuration updates are applied before other commands which may depend on them.
Commit 12 removes the promotion of needs_restart::yes properties with pending values to active. This was done in a few places, and it seems to be bug-prone everywhere. The only place we can safely promote pending properties is at the beginning of the bootstrap process, when we are reading from our local config_cache or a controller snapshot fetched from the leader.
Commits 13-14 add the new RPC for fetching the controller snapshot using the snapshot of the list of members local to a node - this ultimately should be directed to the controller leader, but we permit this request to fail (unlike a join request, which cannot fail, or else a Redpanda node will refuse to start). Failing means we might start with a stale view of the config, but the node will be made eventually consistent after establishing raft0 and replaying the controller log (the application of which might require the node to restart again).
Commit 15 is the big one which refactors the bootstrap process to use this new RPC, and re-order a bunch of dependencies which are required to do so. It is hard to split this commit up for the sake of reviewing, and the diff is also difficult to look at, so I would encourage trying to read the code directly.
Commit 16 adds shard_local_cfg_unsafe() and a new assert to shard_local_cfg() which will pick up on bugs due to use of the shard_local_cfg() before the cluster view has been established and the shard_local_cfg() has been marked as ready. shard_local_cfg_unsafe() use should be limited to configs that have little to no effect on the system if a stale value is used- unfortunately there are a few places in the code where this is currently unavoidable. This commit is somewhat optional in the long run, but has been a great source of finding bugs and ironing out the bootstrap refactor.
Commit 17 adds some tests which would fail previously due to our mishandling of stale cluster configs (joining node test is mostly for regression, it would pass before).

Backports Required

Release Notes

Bug Fixes

Fixes a bug in which stale configuration values could be read by joining or restarting redpanda nodes during the bootstrapping process.

Copilot

Pull request overview

This PR reworks Redpanda’s startup/bootstrap sequencing to prevent early reads of stale cluster configuration during node join/restart. It introduces a “config readiness” gate for config::shard_local_cfg(), splits kvstore recovery from writer startup, and adds an RPC path to fetch a leader-authoritative controller snapshot for restarting nodes.

Changes:

Add a readiness check to config::shard_local_cfg() plus an *_unsafe() escape hatch for early bootstrap.
Split kvstore recovery into an explicit phase, and restructure application bootstrap to recover/apply snapshots before marking config “ready”.
Add a fetch_controller_snapshot RPC and controller snapshot plumbing to support leader-authoritative bootstrap for restarting nodes.

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/v/storage/storage_resources.h	Extend ctor signature and add API to update append chunk size during bootstrap.
src/v/storage/storage_resources.cc	Plumb append chunk size into `storage_resources` initialization and add updater method.
src/v/storage/kvstore.h	Make `recover()` a public entry point and add `start(recover_t)` mode switch.
src/v/storage/kvstore.cc	Implement split recovery/start flow; adjust read APIs to require recovery rather than full start.
src/v/storage/api.h	Construct kvstore earlier and add `recover_kvstore()` bootstrap hook before `start()`.
src/v/redpanda/BUILD	Add Bazel deps needed by updated application/bootstrap compilation units.
src/v/redpanda/application.h	Reshape bootstrap APIs into async helpers for applying snapshots/config and discovery.
src/v/redpanda/application.cc	Move cluster-config hydration earlier and ensure feature-table metrics are set up during metrics init.
src/v/redpanda/application_start.cc	Use stored `cluster_discovery` instance and remove passing discovery by reference into runtime start.
src/v/redpanda/application_config.cc	Use `shard_local_cfg_unsafe()` for bootstrap-time kvstore config extraction.
src/v/redpanda/application_bootstrap.cc	Rework bootstrap sequence: recover kvstore, apply snapshots, discovery/join, then mark config ready.
src/v/features/feature_table.h	Add `setup_metrics()` API.
src/v/features/feature_table.cc	Defer probe/metrics setup from ctor to explicit `setup_metrics()`.
src/v/config/node_config.cc	Use `shard_local_cfg_unsafe()` during node-config YAML loading/ignores.
src/v/config/configuration.h	Add readiness flag + document `shard_local_cfg()` vs `shard_local_cfg_unsafe()`.
src/v/config/configuration.cc	Implement `shard_local_cfg_unsafe()` and guard `shard_local_cfg()` with readiness vassert.
src/v/cluster/types.h	Add serde RPC request/reply types for fetching a controller snapshot.
src/v/cluster/service.h	Add controller service RPC endpoint for `fetch_controller_snapshot`.
src/v/cluster/service.cc	Implement RPC handler to delegate snapshot fetch logic to members manager.
src/v/cluster/members_manager.h	Declare handler to produce/forward leader-authoritative controller snapshot replies.
src/v/cluster/members_manager.cc	Implement snapshot fetch path (forward to leader if needed; leader builds partial join snapshot).
src/v/cluster/controller.json	Register new `fetch_controller_snapshot` RPC in controller protocol schema.
src/v/cluster/controller.cc	Adjust controller start wiring after config_manager ctor signature change.
src/v/cluster/controller_stm.h	Generalize join snapshot construction via templated backend selection.
src/v/cluster/controller_stm.cc	Reorder snapshot application so config manager applies earlier; remove old join snapshot impl.
src/v/cluster/config_manager.h	Remove cluster recovery table dependency from config manager constructor/state.
src/v/cluster/config_manager.cc	Switch more bootstrap-time reads to `shard_local_cfg_unsafe()` and adjust replay/bootstrap flow.
src/v/cluster/cluster_discovery.h	Update discovery ctor signature and add controller snapshot fetch API (with stale-comment mismatch).
src/v/cluster/cluster_discovery.cc	Implement new controller snapshot fetch RPC path and decouple discovery from storage::api.

Comments suppressed due to low confidence (1)

src/v/cluster/config_manager.cc:193

config_manager::apply_local() still stages updates to needs_restart properties via set_pending_value(...), but config_manager::start() no longer promotes pending values after STM replay. This means nodes may never apply needs_restart config updates even after a restart (pending values remain invisible indefinitely). Reintroduce a promotion point after replay (as before), or otherwise ensure pending values are promoted at least once during startup after replay completes.

ss::future<> config_manager::start() {
    if (_seen_version == config_version_unset) {
        vlog(clusterlog.trace, "Starting config_manager... (initial)");

        // Background: wait till we have a leader. If this node is leader

WillemKauf · 2026-05-06T22:23:44Z

+    // Flushing background fiber
+    ssx::spawn_with_gate(_gate, [this] {
+        return ss::do_until(
+          [this] { return _gate.is_closed(); },
+          [this] {


Meh probably unnecessary

+// Accessors for the shard local cluster configuration.
+// shard_local_cfg() contains a vassert which strictly enforces that the
+// configuration has been marked as ready by the bootstrap process after we have
+// successfully preloaded from our node local state, and furthermore, recieved


+    // Static because the call doesn't need any cluster_discovery
+    // instance state — it dispatches via a one-shot RPC client and
+    // reads the seed-server list from global node config.


vbotbuildovich · 2026-05-08T03:51:09Z

CI test results

test results on build#84202

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FLAKY(FAIL)	ClusterConfigTest	test_rpk_force_reset	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3ea1-49dd-bb8e-0a64ae99c8fe	8/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigTest&test_method=test_rpk_force_reset
FAIL	FeaturesNodeJoinTest	test_synthetic_too_new_node_join	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9c-4669-89a1-36bb9d6fa5c0	0/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=FeaturesNodeJoinTest&test_method=test_synthetic_too_new_node_join
FAIL	FeaturesNodeJoinTest	test_synthetic_too_new_node_join	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5260-44b8-ad12-f9c3d286da0b	0/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=FeaturesNodeJoinTest&test_method=test_synthetic_too_new_node_join
FAIL	CertificateRevocationTest	test_rpc	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9f-4b11-82cd-7e01454e44de	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CertificateRevocationTest&test_method=test_rpc
FAIL	CertificateRevocationTest	test_rpc	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5262-420d-a506-1b5098a13685	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CertificateRevocationTest&test_method=test_rpc
FAIL	RpkRedpandaStartTest	test_rpc_tls_enable	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3ea0-4d17-8390-9427606c8ffa	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RpkRedpandaStartTest&test_method=test_rpc_tls_enable
FAIL	RpkRedpandaStartTest	test_rpc_tls_enable	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5263-4a19-bf77-cdc8e64969cd	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RpkRedpandaStartTest&test_method=test_rpc_tls_enable
FAIL	RpkRedpandaStartTest	test_rpc_tls_start	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3ea1-49dd-bb8e-0a64ae99c8fe	0/1	Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RpkRedpandaStartTest&test_method=test_rpc_tls_start
FAIL	RpkRedpandaStartTest	test_rpc_tls_start	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5263-4cd8-a7a0-8ca4e4693bdc	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RpkRedpandaStartTest&test_method=test_rpc_tls_start
FAIL	RpcTLSSecurityReportTest	test_security_report	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9e-40ef-8b73-ec044370e9db	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RpcTLSSecurityReportTest&test_method=test_security_report
FAIL	RpcTLSSecurityReportTest	test_security_report	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5261-4486-a865-b030a4e0b0f9	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RpcTLSSecurityReportTest&test_method=test_security_report
FAIL	TLSMetricsTest	test_services	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9c-4669-89a1-36bb9d6fa5c0	0/1	Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSMetricsTest&test_method=test_services
FAIL	TLSMetricsTest	test_services	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5260-44b8-ad12-f9c3d286da0b	0/1	Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSMetricsTest&test_method=test_services
FAIL	TLSVersionTestECDSA	test_change_version	{"version": 0}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3ea0-41c4-86ab-917194614e4f	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL	TLSVersionTestECDSA	test_change_version	{"version": 0}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5262-49de-b7b5-8309afbf173b	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL	TLSVersionTestECDSA	test_change_version	{"version": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3ea0-4d17-8390-9427606c8ffa	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL	TLSVersionTestECDSA	test_change_version	{"version": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5263-4a19-bf77-cdc8e64969cd	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL	TLSVersionTestECDSA	test_change_version	{"version": 2}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3ea1-4663-abba-af512ac5e381	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL	TLSVersionTestECDSA	test_change_version	{"version": 2}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5263-42e9-af8a-82994ac7ce53	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL	TLSVersionTestECDSA	test_change_version	{"version": 3}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3ea1-49dd-bb8e-0a64ae99c8fe	0/1	Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL	TLSVersionTestECDSA	test_change_version	{"version": 3}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5263-4cd8-a7a0-8ca4e4693bdc	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL	TLSVersionTestECDSA	test_ciphersuite_support	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9c-43f6-a47d-d73c89f75998	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_ciphersuite_support
FAIL	TLSVersionTestECDSA	test_ciphersuite_support	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5260-49a4-8f40-4ecb6145305b	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_ciphersuite_support
FAIL	TLSVersionTestRSA	test_change_version	{"version": 0}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9c-4669-89a1-36bb9d6fa5c0	0/1	Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL	TLSVersionTestRSA	test_change_version	{"version": 0}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5260-44b8-ad12-f9c3d286da0b	0/1	Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL	TLSVersionTestRSA	test_change_version	{"version": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9d-4b90-9f8a-43ca2c9c719a	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL	TLSVersionTestRSA	test_change_version	{"version": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5261-484b-aca9-115be0c42599	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL	TLSVersionTestRSA	test_change_version	{"version": 2}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9e-40ef-8b73-ec044370e9db	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL	TLSVersionTestRSA	test_change_version	{"version": 2}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5261-4486-a865-b030a4e0b0f9	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL	TLSVersionTestRSA	test_change_version	{"version": 3}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9e-4544-9cbc-dd8561b11383	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL	TLSVersionTestRSA	test_change_version	{"version": 3}	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5262-4d07-8d67-fe509de67342	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL	TLSVersionTestRSA	test_ciphersuite_support	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9f-4b11-82cd-7e01454e44de	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_ciphersuite_support
FAIL	TLSVersionTestRSA	test_ciphersuite_support	null	integration	https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5262-420d-a506-1b5098a13685	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_ciphersuite_support
FAIL	src/v/cloud_topics/level_zero/tests/l0_object_size_distribution_test	src/v/cloud_topics/level_zero/tests/l0_object_size_distribution_test		unit	https://buildkite.com/redpanda/redpanda/builds/84202#019e055c-3301-4e30-83bb-4c4452377eb5	0/1

test results on build#84233

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FLAKY(PASS)	WriteCachingFailureInjectionE2ETest	test_crash_all	{"use_transactions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/84233#019e08af-004e-4bdb-9ce4-8884613170bb	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.1082, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.2909, p1=0.0322, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

test results on build#84237

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FLAKY(PASS)	DataMigrationsApiTest	test_creating_and_listing_migrations	null	integration	https://buildkite.com/redpanda/redpanda/builds/84237#019e0933-12d6-4dff-9e54-1fe31f824194	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_creating_and_listing_migrations
FLAKY(PASS)	InternalTopicProtectionLargeClusterTest	test_consumer_offset_topic	null	integration	https://buildkite.com/redpanda/redpanda/builds/84237#019e0935-cca3-4020-bc9c-cffda47d02e7	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0043, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=InternalTopicProtectionLargeClusterTest&test_method=test_consumer_offset_topic
FAIL	NodePoolMigrationTest	test_migrating_redpanda_nodes_to_new_pool	{"balancing_mode": "node_add", "cleanup_policy": "compact", "test_mode": "tiered_storage_fast_moves"}	integration	https://buildkite.com/redpanda/redpanda/builds/84237#019e0935-cca6-484d-8dfd-937352547515	0/1		https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodePoolMigrationTest&test_method=test_migrating_redpanda_nodes_to_new_pool

test results on build#84245

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FAIL	MasterTestSuite	update_test		unit	https://buildkite.com/redpanda/redpanda/builds/84245#019e0982-db75-4ce1-9fda-fb620407efcb	0/1
FLAKY(FAIL)	ClusterConfigTest	test_rpk_force_reset	null	integration	https://buildkite.com/redpanda/redpanda/builds/84245#019e099a-209a-429a-a4f3-2f074a3d58a5	3/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigTest&test_method=test_rpk_force_reset
FLAKY(FAIL)	ClusterConfigTest	test_rpk_force_reset	null	integration	https://buildkite.com/redpanda/redpanda/builds/84245#019e09a7-2bc4-46c7-8a9d-2cd30ba6db70	5/11	Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigTest&test_method=test_rpk_force_reset

test results on build#84250

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FLAKY(PASS)	WriteCachingFailureInjectionE2ETest	test_crash_all	{"use_transactions": false}	integration	https://buildkite.com/redpanda/redpanda/builds/84250#019e0b32-4c47-411e-9092-62d62d334781	9/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0988, p0=0.6467, reject_threshold=0.0100. adj_baseline=0.2681, p1=0.2057, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

We will want to be able to use the `kvstore` for read only purposes early in the bootstrap process to recover important persisted state. Expose `kvstore::recover()` as a way to allow read-only users access to its state. Also expose an accessor higher up in `storage::api`.

The `kvstore` should be available after constructor is called for ease of access.

…covery` We were only using this in order to access the `node_uuid` and `cluster_uuid`, which are constant after they are first set. Simply pass these values directly instead of passing the entire `storage::api` into the constructor.

And rework some of the metrics registration logic. This is in order to avoid initial calls to `shard_local_cfg()` in the bootstrap process when (potentially) setting up a TLS probe for the RPC server.

Will be used in the future as a static member function without a `members_manager` instance.

We are going to want control over what backends we need to include in a `controller_stm` snapshot in a future commit. Move this function to the header and use a variadic template for this purpose.

Pull config_manager::apply_snapshot out of the ss::when_all parallel block and run it serially after feature_backend and before members_manager. Downstream backends (members, topics, plugin, recovery, security, quota, migrations, cluster_link) may consult shard_local_cfg during their own apply_snapshot work; sequencing config_manager first guarantees they see fresh values.

Removes the promote_all_pending() call from config_manager::start(). Pending values that accumulated during the post-bootstrap controller log replay (i.e. cluster_config_delta_cmd records applied above the hydrated snapshot offset) now stay pending instead of being promoted when config_manager::start runs. This is the strict reading of needs_restart=yes semantics: a property change visible to a node only after it restarts. shard_local_cfg() is already hydrated from the controller snapshot before bootstrap proceeds (see application::bootstrap_cluster_config_view), so any delta replayed afterward is a change the cluster made *after* the hydration point and should not retroactively affect the running node's active values for needs_restart properties.

A bit of code extrapolated out. Currently used for gating `join_node_request`s, will be used in a future commit for handling `fetch_controller_snapshot` requests as well.

A new RPC to fetch a controller snapshot, containing the latest in-memory state of the controller leader's `config_manager` and `feature_backend`. This RPC will be called unconditionally during bootstrap of a `redpanda` broker, whether that node is simply restarting or is joining for the first time. The reason for this being that a local node who is down for a long time will have only stale configurations and features in its local cache/snapshots/log, and the configuration and feature managers are two global pieces of state that should be brought to a consistent view early on in a `redpanda` node's lifetime.

The start-up process for `redpanda` now looks like this: 1. Cluster configs are hydrated from the local `config_cache` and legacy `node_cfg_yaml`. 2. `storage::api` is constructed (but not started), and the local `kvstore` is recovered. 3. We bootstrap from the `kvstore`: 1. Apply the local feature table snapshot (potentially stale) from the `kvstore`. 2. We attempt to load any existing cluster/node UUIDs from the `kvstore`. 4. We perform cluster discovery using the RPC `bootstrap_service`. 1. If we are a joiner, we will perform our RPC to register with the cluster, and collect the `controller_join_snapshot` this way. 2. If we are simply a restarter, we have our snapshot of the cluster members loaded, and we use this list to issue an RPC (which will eventually hit the controller leader) to fetch the most up-to-date version of the controller state (specifically, the cluster config and the feature table). 5. We now have a consistent view with the rest of the cluster of the feature table and cluster configuration for the rest of the bootstrapping process. raft0 replay still happens _after_ the rest of the system is started. A big caveat with cluster recovery - `needs_restart` properties fetched can no longer be applied until the recovered node has restarted. There is no clear way to make this _not_ the behavior.

And swap any uses of `shard_local_cfg()` which occur in the bootstrap process (before it is marked as `ready`) over to the unsafe accessor. These uses of potentially stale values are all audited and deemed safe, due to minimal or no impact on the behavior of the system post bootstrap. Two other relevant changes in this commit: * Pulling out `feature_table::setup_metrics()` into its own function, called after the bootstrap process (in order to properly respect `shard_local_cfg().disable_{public}_metrics`) * Updating `storage_resources` to _not_ invoke the `internal::chunks()` singleton constructor, which would in turn invoke `shard_local_cfg()`. We use `storage_resources for an early replay of the `kvstore` segments, so it is fine to use stale configuration values here. However, after bootstrapping is complete, it is *probably* important to set the `_append_chunk_size` back to the same value as what `internal::chunks()` is operating with. We now do so when initializing the storage system in `wire_up_bootstrap_services()`.

Copilot AI review requested due to automatic review settings May 6, 2026 22:03

WillemKauf requested a review from a team as a code owner May 6, 2026 22:03

github-actions Bot added area/build area/redpanda labels May 6, 2026

Copilot started reviewing on behalf of WillemKauf May 6, 2026 22:04 View session

WillemKauf force-pushed the config_bootstrap_fix branch from 4a97a52 to 4755fa7 Compare May 6, 2026 22:04

Copilot AI reviewed May 6, 2026

View reviewed changes

WillemKauf force-pushed the config_bootstrap_fix branch 4 times, most recently from 72cd608 to 0140ac7 Compare May 7, 2026 16:10

This comment was marked as outdated.

Sign in to view

WillemKauf force-pushed the config_bootstrap_fix branch 6 times, most recently from 3ec0835 to 2058bd9 Compare May 8, 2026 02:09

This comment was marked as outdated.

Sign in to view

WillemKauf force-pushed the config_bootstrap_fix branch from 2058bd9 to 625236c Compare May 8, 2026 17:11

WillemKauf added 7 commits May 8, 2026 14:29

storage: move around constructor logic in api

c3aee55

The `kvstore` should be available after constructor is called for ease of access.

utils: add adjustable_semaphore::underlying()

28e6e0b

net: use adjustable_semaphore in server

08bc3f7

net: add build_reloadable_server_credentials_with_deferred_probe

2639c88

And rework some of the metrics registration logic. This is in order to avoid initial calls to `shard_local_cfg()` in the bootstrap process when (potentially) setting up a TLS probe for the RPC server.

net: change metrics registration logic

ec6e9cd

WillemKauf force-pushed the config_bootstrap_fix branch from 625236c to d9c2e30 Compare May 8, 2026 19:46

features: change metrics registration logic

963a8dc

WillemKauf added 6 commits May 8, 2026 16:43

cluster: make members_manager::read_members_from_kvstore() static

96534f2

Will be used in the future as a static member function without a `members_manager` instance.

cluster: template-ize maybe_make_join_snapshot

765aaa7

We are going to want control over what backends we need to include in a `controller_stm` snapshot in a future commit. Move this function to the header and use a variadic template for this purpose.

cluster: factor out is_request_logical_version_compatible

23bcc28

A bit of code extrapolated out. Currently used for gating `join_node_request`s, will be used in a future commit for handling `fetch_controller_snapshot` requests as well.

This comment was marked as outdated.

Sign in to view

WillemKauf force-pushed the config_bootstrap_fix branch from d9c2e30 to 7ac0443 Compare May 8, 2026 21:33

WillemKauf changed the title ~~WIP: The great config bootstrap fix~~ redpanda: the great config bootstrap fix May 8, 2026

This comment was marked as outdated.

Sign in to view

WillemKauf added 3 commits May 9, 2026 01:09

rptest: add MultiNodeBootstrapTest

07a3683

WillemKauf force-pushed the config_bootstrap_fix branch from 7ac0443 to 07a3683 Compare May 9, 2026 05:09

WillemKauf changed the title ~~redpanda: the great config bootstrap fix~~ [CORE-13707] redpanda: the great config bootstrap fix May 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE-13707] `redpanda`: the great config bootstrap fix#30396

[CORE-13707] `redpanda`: the great config bootstrap fix#30396
WillemKauf wants to merge 17 commits intoredpanda-data:devfrom
WillemKauf:config_bootstrap_fix

WillemKauf commented May 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

WillemKauf May 6, 2026

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

vbotbuildovich commented May 8, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

WillemKauf commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BLUF

BLUF Continued

Commits

Backports Required

Release Notes

Bug Fixes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

WillemKauf May 6, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

vbotbuildovich commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WillemKauf commented May 6, 2026 •

edited

Loading

vbotbuildovich commented May 8, 2026 •

edited

Loading