Skip to content

[CORE-13707] redpanda: the great config bootstrap fix#30396

Open
WillemKauf wants to merge 17 commits intoredpanda-data:devfrom
WillemKauf:config_bootstrap_fix
Open

[CORE-13707] redpanda: the great config bootstrap fix#30396
WillemKauf wants to merge 17 commits intoredpanda-data:devfrom
WillemKauf:config_bootstrap_fix

Conversation

@WillemKauf
Copy link
Copy Markdown
Contributor

@WillemKauf WillemKauf commented May 6, 2026

BLUF

Re-arrangement of the bootstrap process in order to allow restarting nodes to see an up to date view of the cluster wide state before initializing systems that depend on it.

BLUF Continued

Two important cluster wide states include the cluster configuration and feature table views, which we now attempt to fetch via a new fetch_controller_snapshot RPC to the controller leader, similar to a node joining the cluster for the first time.

This is mostly why the config cache exists, as well as why we snapshot member tables and feature tables into the kvstore- to allow a view of these properties BEFORE bootstrapping. However, these local snapshots can always be stale, and it is important to bring the node up to date with the controller leader before the system is started.

This prevents bugs like:

  • Node goes down for a while, cloud_topics_enabled is set on, a cloud topic is created, node comes back, replays the controller log after bootstrapping itself, doesn't have cloud topics capabilities but creates the topic anyways in an undefined state.
  • Node goes down for a while, log_compaction_use_sliding_window is set on, node comes back, replays the controller log after bootstrapping itself, sets this property on, but no memory reservation has been properly allocated for the sliding window map, potential seg fault.

These are just two examples of the undefined behaviors that can happen when we bootstrap before establishing a cluster wide view of these configurations and/or features.

We can also no longer promote cluster configs during cluster recovery or during controller log replay after the bootstrapping process - that would be invalid behavior.

Commits

  • Commits 1-10 are mostly mechanical changes made in advance of a large bootstrap re-ordering in the main redpanda application. They are all necessary to enable certain behaviors we need for that bootstrap refactoring. These could be pulled out into separate PR(s) for reviewing purposes.

  • Commit 11 is important because when applying controller snapshots, we need to ensure that configuration updates are applied before other commands which may depend on them.

  • Commit 12 removes the promotion of needs_restart::yes properties with pending values to active. This was done in a few places, and it seems to be bug-prone everywhere. The only place we can safely promote pending properties is at the beginning of the bootstrap process, when we are reading from our local config_cache or a controller snapshot fetched from the leader.

  • Commits 13-14 add the new RPC for fetching the controller snapshot using the snapshot of the list of members local to a node - this ultimately should be directed to the controller leader, but we permit this request to fail (unlike a join request, which cannot fail, or else a Redpanda node will refuse to start). Failing means we might start with a stale view of the config, but the node will be made eventually consistent after establishing raft0 and replaying the controller log (the application of which might require the node to restart again).

  • Commit 15 is the big one which refactors the bootstrap process to use this new RPC, and re-order a bunch of dependencies which are required to do so. It is hard to split this commit up for the sake of reviewing, and the diff is also difficult to look at, so I would encourage trying to read the code directly.

  • Commit 16 adds shard_local_cfg_unsafe() and a new assert to shard_local_cfg() which will pick up on bugs due to use of the shard_local_cfg() before the cluster view has been established and the shard_local_cfg() has been marked as ready. shard_local_cfg_unsafe() use should be limited to configs that have little to no effect on the system if a stale value is used- unfortunately there are a few places in the code where this is currently unavoidable. This commit is somewhat optional in the long run, but has been a great source of finding bugs and ironing out the bootstrap refactor.

  • Commit 17 adds some tests which would fail previously due to our mishandling of stale cluster configs (joining node test is mostly for regression, it would pass before).

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

Bug Fixes

  • Fixes a bug in which stale configuration values could be read by joining or restarting redpanda nodes during the bootstrapping process.

Copilot AI review requested due to automatic review settings May 6, 2026 22:03
@WillemKauf WillemKauf requested a review from a team as a code owner May 6, 2026 22:03
@WillemKauf WillemKauf force-pushed the config_bootstrap_fix branch from 4a97a52 to 4755fa7 Compare May 6, 2026 22:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reworks Redpanda’s startup/bootstrap sequencing to prevent early reads of stale cluster configuration during node join/restart. It introduces a “config readiness” gate for config::shard_local_cfg(), splits kvstore recovery from writer startup, and adds an RPC path to fetch a leader-authoritative controller snapshot for restarting nodes.

Changes:

  • Add a readiness check to config::shard_local_cfg() plus an *_unsafe() escape hatch for early bootstrap.
  • Split kvstore recovery into an explicit phase, and restructure application bootstrap to recover/apply snapshots before marking config “ready”.
  • Add a fetch_controller_snapshot RPC and controller snapshot plumbing to support leader-authoritative bootstrap for restarting nodes.

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/v/storage/storage_resources.h Extend ctor signature and add API to update append chunk size during bootstrap.
src/v/storage/storage_resources.cc Plumb append chunk size into storage_resources initialization and add updater method.
src/v/storage/kvstore.h Make recover() a public entry point and add start(recover_t) mode switch.
src/v/storage/kvstore.cc Implement split recovery/start flow; adjust read APIs to require recovery rather than full start.
src/v/storage/api.h Construct kvstore earlier and add recover_kvstore() bootstrap hook before start().
src/v/redpanda/BUILD Add Bazel deps needed by updated application/bootstrap compilation units.
src/v/redpanda/application.h Reshape bootstrap APIs into async helpers for applying snapshots/config and discovery.
src/v/redpanda/application.cc Move cluster-config hydration earlier and ensure feature-table metrics are set up during metrics init.
src/v/redpanda/application_start.cc Use stored cluster_discovery instance and remove passing discovery by reference into runtime start.
src/v/redpanda/application_config.cc Use shard_local_cfg_unsafe() for bootstrap-time kvstore config extraction.
src/v/redpanda/application_bootstrap.cc Rework bootstrap sequence: recover kvstore, apply snapshots, discovery/join, then mark config ready.
src/v/features/feature_table.h Add setup_metrics() API.
src/v/features/feature_table.cc Defer probe/metrics setup from ctor to explicit setup_metrics().
src/v/config/node_config.cc Use shard_local_cfg_unsafe() during node-config YAML loading/ignores.
src/v/config/configuration.h Add readiness flag + document shard_local_cfg() vs shard_local_cfg_unsafe().
src/v/config/configuration.cc Implement shard_local_cfg_unsafe() and guard shard_local_cfg() with readiness vassert.
src/v/cluster/types.h Add serde RPC request/reply types for fetching a controller snapshot.
src/v/cluster/service.h Add controller service RPC endpoint for fetch_controller_snapshot.
src/v/cluster/service.cc Implement RPC handler to delegate snapshot fetch logic to members manager.
src/v/cluster/members_manager.h Declare handler to produce/forward leader-authoritative controller snapshot replies.
src/v/cluster/members_manager.cc Implement snapshot fetch path (forward to leader if needed; leader builds partial join snapshot).
src/v/cluster/controller.json Register new fetch_controller_snapshot RPC in controller protocol schema.
src/v/cluster/controller.cc Adjust controller start wiring after config_manager ctor signature change.
src/v/cluster/controller_stm.h Generalize join snapshot construction via templated backend selection.
src/v/cluster/controller_stm.cc Reorder snapshot application so config manager applies earlier; remove old join snapshot impl.
src/v/cluster/config_manager.h Remove cluster recovery table dependency from config manager constructor/state.
src/v/cluster/config_manager.cc Switch more bootstrap-time reads to shard_local_cfg_unsafe() and adjust replay/bootstrap flow.
src/v/cluster/cluster_discovery.h Update discovery ctor signature and add controller snapshot fetch API (with stale-comment mismatch).
src/v/cluster/cluster_discovery.cc Implement new controller snapshot fetch RPC path and decouple discovery from storage::api.
Comments suppressed due to low confidence (1)

src/v/cluster/config_manager.cc:193

  • config_manager::apply_local() still stages updates to needs_restart properties via set_pending_value(...), but config_manager::start() no longer promotes pending values after STM replay. This means nodes may never apply needs_restart config updates even after a restart (pending values remain invisible indefinitely). Reintroduce a promotion point after replay (as before), or otherwise ensure pending values are promoted at least once during startup after replay completes.
ss::future<> config_manager::start() {
    if (_seen_version == config_version_unset) {
        vlog(clusterlog.trace, "Starting config_manager... (initial)");

        // Background: wait till we have a leader. If this node is leader

Comment thread src/v/storage/api.h Outdated
Comment thread src/v/storage/kvstore.cc
Comment on lines +104 to +108
// Flushing background fiber
ssx::spawn_with_gate(_gate, [this] {
return ss::do_until(
[this] { return _gate.is_closed(); },
[this] {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh probably unnecessary

// Accessors for the shard local cluster configuration.
// shard_local_cfg() contains a vassert which strictly enforces that the
// configuration has been marked as ready by the bootstrap process after we have
// successfully preloaded from our node local state, and furthermore, recieved
Comment thread src/v/cluster/cluster_discovery.h Outdated
Comment on lines +142 to +144
// Static because the call doesn't need any cluster_discovery
// instance state — it dispatches via a one-shot RPC client and
// reads the seed-server list from global node config.
@WillemKauf WillemKauf force-pushed the config_bootstrap_fix branch 4 times, most recently from 72cd608 to 0140ac7 Compare May 7, 2026 16:10
@vbotbuildovich

This comment was marked as outdated.

@WillemKauf WillemKauf force-pushed the config_bootstrap_fix branch 6 times, most recently from 3ec0835 to 2058bd9 Compare May 8, 2026 02:09
@vbotbuildovich

This comment was marked as outdated.

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented May 8, 2026

CI test results

test results on build#84202
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(FAIL) ClusterConfigTest test_rpk_force_reset null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3ea1-49dd-bb8e-0a64ae99c8fe 8/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigTest&test_method=test_rpk_force_reset
FAIL FeaturesNodeJoinTest test_synthetic_too_new_node_join null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9c-4669-89a1-36bb9d6fa5c0 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=FeaturesNodeJoinTest&test_method=test_synthetic_too_new_node_join
FAIL FeaturesNodeJoinTest test_synthetic_too_new_node_join null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5260-44b8-ad12-f9c3d286da0b 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=FeaturesNodeJoinTest&test_method=test_synthetic_too_new_node_join
FAIL CertificateRevocationTest test_rpc null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9f-4b11-82cd-7e01454e44de 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CertificateRevocationTest&test_method=test_rpc
FAIL CertificateRevocationTest test_rpc null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5262-420d-a506-1b5098a13685 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=CertificateRevocationTest&test_method=test_rpc
FAIL RpkRedpandaStartTest test_rpc_tls_enable null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3ea0-4d17-8390-9427606c8ffa 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RpkRedpandaStartTest&test_method=test_rpc_tls_enable
FAIL RpkRedpandaStartTest test_rpc_tls_enable null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5263-4a19-bf77-cdc8e64969cd 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RpkRedpandaStartTest&test_method=test_rpc_tls_enable
FAIL RpkRedpandaStartTest test_rpc_tls_start null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3ea1-49dd-bb8e-0a64ae99c8fe 0/1 Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RpkRedpandaStartTest&test_method=test_rpc_tls_start
FAIL RpkRedpandaStartTest test_rpc_tls_start null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5263-4cd8-a7a0-8ca4e4693bdc 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RpkRedpandaStartTest&test_method=test_rpc_tls_start
FAIL RpcTLSSecurityReportTest test_security_report null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9e-40ef-8b73-ec044370e9db 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RpcTLSSecurityReportTest&test_method=test_security_report
FAIL RpcTLSSecurityReportTest test_security_report null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5261-4486-a865-b030a4e0b0f9 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RpcTLSSecurityReportTest&test_method=test_security_report
FAIL TLSMetricsTest test_services null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9c-4669-89a1-36bb9d6fa5c0 0/1 Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSMetricsTest&test_method=test_services
FAIL TLSMetricsTest test_services null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5260-44b8-ad12-f9c3d286da0b 0/1 Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSMetricsTest&test_method=test_services
FAIL TLSVersionTestECDSA test_change_version {"version": 0} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3ea0-41c4-86ab-917194614e4f 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL TLSVersionTestECDSA test_change_version {"version": 0} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5262-49de-b7b5-8309afbf173b 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL TLSVersionTestECDSA test_change_version {"version": 1} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3ea0-4d17-8390-9427606c8ffa 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL TLSVersionTestECDSA test_change_version {"version": 1} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5263-4a19-bf77-cdc8e64969cd 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL TLSVersionTestECDSA test_change_version {"version": 2} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3ea1-4663-abba-af512ac5e381 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL TLSVersionTestECDSA test_change_version {"version": 2} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5263-42e9-af8a-82994ac7ce53 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL TLSVersionTestECDSA test_change_version {"version": 3} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3ea1-49dd-bb8e-0a64ae99c8fe 0/1 Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL TLSVersionTestECDSA test_change_version {"version": 3} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5263-4cd8-a7a0-8ca4e4693bdc 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_change_version
FAIL TLSVersionTestECDSA test_ciphersuite_support null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9c-43f6-a47d-d73c89f75998 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_ciphersuite_support
FAIL TLSVersionTestECDSA test_ciphersuite_support null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5260-49a4-8f40-4ecb6145305b 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestECDSA&test_method=test_ciphersuite_support
FAIL TLSVersionTestRSA test_change_version {"version": 0} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9c-4669-89a1-36bb9d6fa5c0 0/1 Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL TLSVersionTestRSA test_change_version {"version": 0} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5260-44b8-ad12-f9c3d286da0b 0/1 Test is INCONCLUSIVE after retries.Inconclusive result before max retries(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=1.0000, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL TLSVersionTestRSA test_change_version {"version": 1} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9d-4b90-9f8a-43ca2c9c719a 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL TLSVersionTestRSA test_change_version {"version": 1} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5261-484b-aca9-115be0c42599 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL TLSVersionTestRSA test_change_version {"version": 2} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9e-40ef-8b73-ec044370e9db 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL TLSVersionTestRSA test_change_version {"version": 2} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5261-4486-a865-b030a4e0b0f9 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL TLSVersionTestRSA test_change_version {"version": 3} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9e-4544-9cbc-dd8561b11383 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL TLSVersionTestRSA test_change_version {"version": 3} integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5262-4d07-8d67-fe509de67342 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_change_version
FAIL TLSVersionTestRSA test_ciphersuite_support null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0572-3e9f-4b11-82cd-7e01454e44de 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_ciphersuite_support
FAIL TLSVersionTestRSA test_ciphersuite_support null integration https://buildkite.com/redpanda/redpanda/builds/84202#019e0577-5262-420d-a506-1b5098a13685 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TLSVersionTestRSA&test_method=test_ciphersuite_support
FAIL src/v/cloud_topics/level_zero/tests/l0_object_size_distribution_test src/v/cloud_topics/level_zero/tests/l0_object_size_distribution_test unit https://buildkite.com/redpanda/redpanda/builds/84202#019e055c-3301-4e30-83bb-4c4452377eb5 0/1
test results on build#84233
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/84233#019e08af-004e-4bdb-9ce4-8884613170bb 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.1082, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.2909, p1=0.0322, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
test results on build#84237
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) DataMigrationsApiTest test_creating_and_listing_migrations null integration https://buildkite.com/redpanda/redpanda/builds/84237#019e0933-12d6-4dff-9e54-1fe31f824194 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_creating_and_listing_migrations
FLAKY(PASS) InternalTopicProtectionLargeClusterTest test_consumer_offset_topic null integration https://buildkite.com/redpanda/redpanda/builds/84237#019e0935-cca3-4020-bc9c-cffda47d02e7 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0043, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=InternalTopicProtectionLargeClusterTest&test_method=test_consumer_offset_topic
FAIL NodePoolMigrationTest test_migrating_redpanda_nodes_to_new_pool {"balancing_mode": "node_add", "cleanup_policy": "compact", "test_mode": "tiered_storage_fast_moves"} integration https://buildkite.com/redpanda/redpanda/builds/84237#019e0935-cca6-484d-8dfd-937352547515 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodePoolMigrationTest&test_method=test_migrating_redpanda_nodes_to_new_pool
test results on build#84245
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL MasterTestSuite update_test unit https://buildkite.com/redpanda/redpanda/builds/84245#019e0982-db75-4ce1-9fda-fb620407efcb 0/1
FLAKY(FAIL) ClusterConfigTest test_rpk_force_reset null integration https://buildkite.com/redpanda/redpanda/builds/84245#019e099a-209a-429a-a4f3-2f074a3d58a5 3/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigTest&test_method=test_rpk_force_reset
FLAKY(FAIL) ClusterConfigTest test_rpk_force_reset null integration https://buildkite.com/redpanda/redpanda/builds/84245#019e09a7-2bc4-46c7-8a9d-2cd30ba6db70 5/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigTest&test_method=test_rpk_force_reset
test results on build#84250
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/84250#019e0b32-4c47-411e-9092-62d62d334781 9/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0988, p0=0.6467, reject_threshold=0.0100. adj_baseline=0.2681, p1=0.2057, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

@WillemKauf WillemKauf force-pushed the config_bootstrap_fix branch from 2058bd9 to 625236c Compare May 8, 2026 17:11
WillemKauf added 7 commits May 8, 2026 14:29
We will want to be able to use the `kvstore` for read only purposes
early in the bootstrap process to recover important persisted state.

Expose `kvstore::recover()` as a way to allow read-only users access
to its state. Also expose an accessor higher up in `storage::api`.
The `kvstore` should be available after constructor is called for ease
of access.
…covery`

We were only using this in order to access the `node_uuid` and `cluster_uuid`,
which are constant after they are first set.

Simply pass these values directly instead of passing the entire `storage::api`
into the constructor.
And rework some of the metrics registration logic.

This is in order to avoid initial calls to `shard_local_cfg()` in the
bootstrap process when (potentially) setting up a TLS probe for the RPC
server.
@WillemKauf WillemKauf force-pushed the config_bootstrap_fix branch from 625236c to d9c2e30 Compare May 8, 2026 19:46
WillemKauf added 6 commits May 8, 2026 16:43
Will be used in the future as a static member function without a
`members_manager` instance.
We are going to want control over what backends we need to include
in a `controller_stm` snapshot in a future commit. Move this function
to the header and use a variadic template for this purpose.
Pull config_manager::apply_snapshot out of the ss::when_all parallel
block and run it serially after feature_backend and before
members_manager. Downstream backends (members, topics, plugin,
recovery, security, quota, migrations, cluster_link) may consult
shard_local_cfg during their own apply_snapshot work; sequencing
config_manager first guarantees they see fresh values.
Removes the promote_all_pending() call from config_manager::start().
Pending values that accumulated during the post-bootstrap controller
log replay (i.e. cluster_config_delta_cmd records applied above the
hydrated snapshot offset) now stay pending instead of being promoted
when config_manager::start runs.

This is the strict reading of needs_restart=yes semantics: a property
change visible to a node only after it restarts. shard_local_cfg() is
already hydrated from the controller snapshot before bootstrap
proceeds (see application::bootstrap_cluster_config_view), so any
delta replayed afterward is a change the cluster made *after* the
hydration point and should not retroactively affect the running
node's active values for needs_restart properties.
A bit of code extrapolated out. Currently used for gating
`join_node_request`s, will be used in a future commit for handling
`fetch_controller_snapshot` requests as well.
A new RPC to fetch a controller snapshot, containing the latest in-memory
state of the controller leader's `config_manager` and `feature_backend`.

This RPC will be called unconditionally during bootstrap of a `redpanda`
broker, whether that node is simply restarting or is joining for the first
time. The reason for this being that a local node who is down for a long
time will have only stale configurations and features in its local
cache/snapshots/log, and the configuration and feature managers are two global
pieces of state that should be brought to a consistent view early on in a
`redpanda` node's lifetime.
@vbotbuildovich

This comment was marked as outdated.

@WillemKauf WillemKauf force-pushed the config_bootstrap_fix branch from d9c2e30 to 7ac0443 Compare May 8, 2026 21:33
@WillemKauf WillemKauf changed the title WIP: The great config bootstrap fix redpanda: the great config bootstrap fix May 8, 2026
@vbotbuildovich

This comment was marked as outdated.

WillemKauf added 3 commits May 9, 2026 01:09
The start-up process for `redpanda` now looks like this:

1. Cluster configs are hydrated from the local `config_cache` and legacy
   `node_cfg_yaml`.
2. `storage::api` is constructed (but not started), and the local `kvstore`
   is recovered.
3. We bootstrap from the `kvstore`:
   1. Apply the local feature table snapshot (potentially stale) from the `kvstore`.
   2. We attempt to load any existing cluster/node UUIDs from the `kvstore`.
4. We perform cluster discovery using the RPC `bootstrap_service`.
   1. If we are a joiner, we will perform our RPC to register with the cluster,
      and collect the `controller_join_snapshot` this way.
   2. If we are simply a restarter, we have our snapshot of the cluster members
      loaded, and we use this list to issue an RPC (which will eventually hit the
      controller leader) to fetch the most up-to-date version of the controller
      state (specifically, the cluster config and the feature table).
5. We now have a consistent view with the rest of the cluster of the feature
   table and cluster configuration for the rest of the bootstrapping process.

raft0 replay still happens _after_ the rest of the system is started.

A big caveat with cluster recovery - `needs_restart` properties fetched can no
longer be applied until the recovered node has restarted. There is no clear way
to make this _not_ the behavior.
And swap any uses of `shard_local_cfg()` which occur in the bootstrap
process (before it is marked as `ready`) over to the unsafe accessor.
These uses of potentially stale values are all audited and deemed safe,
due to minimal or no impact on the behavior of the system post bootstrap.

Two other relevant changes in this commit:
* Pulling out `feature_table::setup_metrics()` into its own function,
  called after the bootstrap process (in order to properly respect
  `shard_local_cfg().disable_{public}_metrics`)
* Updating `storage_resources` to _not_ invoke the `internal::chunks()`
  singleton constructor, which would in turn invoke `shard_local_cfg()`.
  We use `storage_resources for an early replay of the `kvstore` segments,
  so it is fine to use stale configuration values here. However, after
  bootstrapping is complete, it is *probably* important to set the
  `_append_chunk_size` back to the same value as what `internal::chunks()`
  is operating with. We now do so when initializing the storage system in
  `wire_up_bootstrap_services()`.
@WillemKauf WillemKauf force-pushed the config_bootstrap_fix branch from 7ac0443 to 07a3683 Compare May 9, 2026 05:09
@WillemKauf WillemKauf changed the title redpanda: the great config bootstrap fix [CORE-13707] redpanda: the great config bootstrap fix May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants