[CORE-13707] redpanda: the great config bootstrap fix#30396
Open
WillemKauf wants to merge 17 commits intoredpanda-data:devfrom
Open
[CORE-13707] redpanda: the great config bootstrap fix#30396WillemKauf wants to merge 17 commits intoredpanda-data:devfrom
redpanda: the great config bootstrap fix#30396WillemKauf wants to merge 17 commits intoredpanda-data:devfrom
Conversation
4a97a52 to
4755fa7
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reworks Redpanda’s startup/bootstrap sequencing to prevent early reads of stale cluster configuration during node join/restart. It introduces a “config readiness” gate for config::shard_local_cfg(), splits kvstore recovery from writer startup, and adds an RPC path to fetch a leader-authoritative controller snapshot for restarting nodes.
Changes:
- Add a readiness check to
config::shard_local_cfg()plus an*_unsafe()escape hatch for early bootstrap. - Split
kvstorerecovery into an explicit phase, and restructure application bootstrap to recover/apply snapshots before marking config “ready”. - Add a
fetch_controller_snapshotRPC and controller snapshot plumbing to support leader-authoritative bootstrap for restarting nodes.
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/v/storage/storage_resources.h | Extend ctor signature and add API to update append chunk size during bootstrap. |
| src/v/storage/storage_resources.cc | Plumb append chunk size into storage_resources initialization and add updater method. |
| src/v/storage/kvstore.h | Make recover() a public entry point and add start(recover_t) mode switch. |
| src/v/storage/kvstore.cc | Implement split recovery/start flow; adjust read APIs to require recovery rather than full start. |
| src/v/storage/api.h | Construct kvstore earlier and add recover_kvstore() bootstrap hook before start(). |
| src/v/redpanda/BUILD | Add Bazel deps needed by updated application/bootstrap compilation units. |
| src/v/redpanda/application.h | Reshape bootstrap APIs into async helpers for applying snapshots/config and discovery. |
| src/v/redpanda/application.cc | Move cluster-config hydration earlier and ensure feature-table metrics are set up during metrics init. |
| src/v/redpanda/application_start.cc | Use stored cluster_discovery instance and remove passing discovery by reference into runtime start. |
| src/v/redpanda/application_config.cc | Use shard_local_cfg_unsafe() for bootstrap-time kvstore config extraction. |
| src/v/redpanda/application_bootstrap.cc | Rework bootstrap sequence: recover kvstore, apply snapshots, discovery/join, then mark config ready. |
| src/v/features/feature_table.h | Add setup_metrics() API. |
| src/v/features/feature_table.cc | Defer probe/metrics setup from ctor to explicit setup_metrics(). |
| src/v/config/node_config.cc | Use shard_local_cfg_unsafe() during node-config YAML loading/ignores. |
| src/v/config/configuration.h | Add readiness flag + document shard_local_cfg() vs shard_local_cfg_unsafe(). |
| src/v/config/configuration.cc | Implement shard_local_cfg_unsafe() and guard shard_local_cfg() with readiness vassert. |
| src/v/cluster/types.h | Add serde RPC request/reply types for fetching a controller snapshot. |
| src/v/cluster/service.h | Add controller service RPC endpoint for fetch_controller_snapshot. |
| src/v/cluster/service.cc | Implement RPC handler to delegate snapshot fetch logic to members manager. |
| src/v/cluster/members_manager.h | Declare handler to produce/forward leader-authoritative controller snapshot replies. |
| src/v/cluster/members_manager.cc | Implement snapshot fetch path (forward to leader if needed; leader builds partial join snapshot). |
| src/v/cluster/controller.json | Register new fetch_controller_snapshot RPC in controller protocol schema. |
| src/v/cluster/controller.cc | Adjust controller start wiring after config_manager ctor signature change. |
| src/v/cluster/controller_stm.h | Generalize join snapshot construction via templated backend selection. |
| src/v/cluster/controller_stm.cc | Reorder snapshot application so config manager applies earlier; remove old join snapshot impl. |
| src/v/cluster/config_manager.h | Remove cluster recovery table dependency from config manager constructor/state. |
| src/v/cluster/config_manager.cc | Switch more bootstrap-time reads to shard_local_cfg_unsafe() and adjust replay/bootstrap flow. |
| src/v/cluster/cluster_discovery.h | Update discovery ctor signature and add controller snapshot fetch API (with stale-comment mismatch). |
| src/v/cluster/cluster_discovery.cc | Implement new controller snapshot fetch RPC path and decouple discovery from storage::api. |
Comments suppressed due to low confidence (1)
src/v/cluster/config_manager.cc:193
config_manager::apply_local()still stages updates toneeds_restartproperties viaset_pending_value(...), butconfig_manager::start()no longer promotes pending values after STM replay. This means nodes may never applyneeds_restartconfig updates even after a restart (pending values remain invisible indefinitely). Reintroduce a promotion point after replay (as before), or otherwise ensure pending values are promoted at least once during startup after replay completes.
ss::future<> config_manager::start() {
if (_seen_version == config_version_unset) {
vlog(clusterlog.trace, "Starting config_manager... (initial)");
// Background: wait till we have a leader. If this node is leader
Comment on lines
+104
to
+108
| // Flushing background fiber | ||
| ssx::spawn_with_gate(_gate, [this] { | ||
| return ss::do_until( | ||
| [this] { return _gate.is_closed(); }, | ||
| [this] { |
Contributor
Author
There was a problem hiding this comment.
Meh probably unnecessary
| // Accessors for the shard local cluster configuration. | ||
| // shard_local_cfg() contains a vassert which strictly enforces that the | ||
| // configuration has been marked as ready by the bootstrap process after we have | ||
| // successfully preloaded from our node local state, and furthermore, recieved |
Comment on lines
+142
to
+144
| // Static because the call doesn't need any cluster_discovery | ||
| // instance state — it dispatches via a one-shot RPC client and | ||
| // reads the seed-server list from global node config. |
72cd608 to
0140ac7
Compare
This comment was marked as outdated.
This comment was marked as outdated.
3ec0835 to
2058bd9
Compare
This comment was marked as outdated.
This comment was marked as outdated.
Collaborator
2058bd9 to
625236c
Compare
We will want to be able to use the `kvstore` for read only purposes early in the bootstrap process to recover important persisted state. Expose `kvstore::recover()` as a way to allow read-only users access to its state. Also expose an accessor higher up in `storage::api`.
The `kvstore` should be available after constructor is called for ease of access.
…covery` We were only using this in order to access the `node_uuid` and `cluster_uuid`, which are constant after they are first set. Simply pass these values directly instead of passing the entire `storage::api` into the constructor.
And rework some of the metrics registration logic. This is in order to avoid initial calls to `shard_local_cfg()` in the bootstrap process when (potentially) setting up a TLS probe for the RPC server.
625236c to
d9c2e30
Compare
Will be used in the future as a static member function without a `members_manager` instance.
We are going to want control over what backends we need to include in a `controller_stm` snapshot in a future commit. Move this function to the header and use a variadic template for this purpose.
Pull config_manager::apply_snapshot out of the ss::when_all parallel block and run it serially after feature_backend and before members_manager. Downstream backends (members, topics, plugin, recovery, security, quota, migrations, cluster_link) may consult shard_local_cfg during their own apply_snapshot work; sequencing config_manager first guarantees they see fresh values.
Removes the promote_all_pending() call from config_manager::start(). Pending values that accumulated during the post-bootstrap controller log replay (i.e. cluster_config_delta_cmd records applied above the hydrated snapshot offset) now stay pending instead of being promoted when config_manager::start runs. This is the strict reading of needs_restart=yes semantics: a property change visible to a node only after it restarts. shard_local_cfg() is already hydrated from the controller snapshot before bootstrap proceeds (see application::bootstrap_cluster_config_view), so any delta replayed afterward is a change the cluster made *after* the hydration point and should not retroactively affect the running node's active values for needs_restart properties.
A bit of code extrapolated out. Currently used for gating `join_node_request`s, will be used in a future commit for handling `fetch_controller_snapshot` requests as well.
A new RPC to fetch a controller snapshot, containing the latest in-memory state of the controller leader's `config_manager` and `feature_backend`. This RPC will be called unconditionally during bootstrap of a `redpanda` broker, whether that node is simply restarting or is joining for the first time. The reason for this being that a local node who is down for a long time will have only stale configurations and features in its local cache/snapshots/log, and the configuration and feature managers are two global pieces of state that should be brought to a consistent view early on in a `redpanda` node's lifetime.
This comment was marked as outdated.
This comment was marked as outdated.
d9c2e30 to
7ac0443
Compare
redpanda: the great config bootstrap fix
This comment was marked as outdated.
This comment was marked as outdated.
The start-up process for `redpanda` now looks like this:
1. Cluster configs are hydrated from the local `config_cache` and legacy
`node_cfg_yaml`.
2. `storage::api` is constructed (but not started), and the local `kvstore`
is recovered.
3. We bootstrap from the `kvstore`:
1. Apply the local feature table snapshot (potentially stale) from the `kvstore`.
2. We attempt to load any existing cluster/node UUIDs from the `kvstore`.
4. We perform cluster discovery using the RPC `bootstrap_service`.
1. If we are a joiner, we will perform our RPC to register with the cluster,
and collect the `controller_join_snapshot` this way.
2. If we are simply a restarter, we have our snapshot of the cluster members
loaded, and we use this list to issue an RPC (which will eventually hit the
controller leader) to fetch the most up-to-date version of the controller
state (specifically, the cluster config and the feature table).
5. We now have a consistent view with the rest of the cluster of the feature
table and cluster configuration for the rest of the bootstrapping process.
raft0 replay still happens _after_ the rest of the system is started.
A big caveat with cluster recovery - `needs_restart` properties fetched can no
longer be applied until the recovered node has restarted. There is no clear way
to make this _not_ the behavior.
And swap any uses of `shard_local_cfg()` which occur in the bootstrap
process (before it is marked as `ready`) over to the unsafe accessor.
These uses of potentially stale values are all audited and deemed safe,
due to minimal or no impact on the behavior of the system post bootstrap.
Two other relevant changes in this commit:
* Pulling out `feature_table::setup_metrics()` into its own function,
called after the bootstrap process (in order to properly respect
`shard_local_cfg().disable_{public}_metrics`)
* Updating `storage_resources` to _not_ invoke the `internal::chunks()`
singleton constructor, which would in turn invoke `shard_local_cfg()`.
We use `storage_resources for an early replay of the `kvstore` segments,
so it is fine to use stale configuration values here. However, after
bootstrapping is complete, it is *probably* important to set the
`_append_chunk_size` back to the same value as what `internal::chunks()`
is operating with. We now do so when initializing the storage system in
`wire_up_bootstrap_services()`.
7ac0443 to
07a3683
Compare
redpanda: the great config bootstrap fixredpanda: the great config bootstrap fix
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BLUF
Re-arrangement of the bootstrap process in order to allow restarting nodes to see an up to date view of the cluster wide state before initializing systems that depend on it.
BLUF Continued
Two important cluster wide states include the cluster configuration and feature table views, which we now attempt to fetch via a new
fetch_controller_snapshotRPC to the controller leader, similar to a node joining the cluster for the first time.This is mostly why the config cache exists, as well as why we snapshot member tables and feature tables into the
kvstore- to allow a view of these properties BEFORE bootstrapping. However, these local snapshots can always be stale, and it is important to bring the node up to date with the controller leader before the system is started.This prevents bugs like:
cloud_topics_enabledis set on, a cloud topic is created, node comes back, replays the controller log after bootstrapping itself, doesn't have cloud topics capabilities but creates the topic anyways in an undefined state.log_compaction_use_sliding_windowis set on, node comes back, replays the controller log after bootstrapping itself, sets this property on, but no memory reservation has been properly allocated for the sliding window map, potential seg fault.These are just two examples of the undefined behaviors that can happen when we bootstrap before establishing a cluster wide view of these configurations and/or features.
We can also no longer promote cluster configs during cluster recovery or during controller log replay after the bootstrapping process - that would be invalid behavior.
Commits
Commits 1-10 are mostly mechanical changes made in advance of a large bootstrap re-ordering in the main
redpandaapplication. They are all necessary to enable certain behaviors we need for that bootstrap refactoring. These could be pulled out into separate PR(s) for reviewing purposes.Commit 11 is important because when applying controller snapshots, we need to ensure that configuration updates are applied before other commands which may depend on them.
Commit 12 removes the promotion of
needs_restart::yesproperties with pending values to active. This was done in a few places, and it seems to be bug-prone everywhere. The only place we can safely promote pending properties is at the beginning of the bootstrap process, when we are reading from our localconfig_cacheor a controller snapshot fetched from the leader.Commits 13-14 add the new RPC for fetching the controller snapshot using the snapshot of the list of members local to a node - this ultimately should be directed to the controller leader, but we permit this request to fail (unlike a join request, which cannot fail, or else a Redpanda node will refuse to start). Failing means we might start with a stale view of the config, but the node will be made eventually consistent after establishing
raft0and replaying the controller log (the application of which might require the node to restart again).Commit 15 is the big one which refactors the bootstrap process to use this new RPC, and re-order a bunch of dependencies which are required to do so. It is hard to split this commit up for the sake of reviewing, and the diff is also difficult to look at, so I would encourage trying to read the code directly.
Commit 16 adds
shard_local_cfg_unsafe()and a new assert toshard_local_cfg()which will pick up on bugs due to use of theshard_local_cfg()before the cluster view has been established and theshard_local_cfg()has been marked as ready.shard_local_cfg_unsafe()use should be limited to configs that have little to no effect on the system if a stale value is used- unfortunately there are a few places in the code where this is currently unavoidable. This commit is somewhat optional in the long run, but has been a great source of finding bugs and ironing out the bootstrap refactor.Commit 17 adds some tests which would fail previously due to our mishandling of stale cluster configs (joining node test is mostly for regression, it would pass before).
Backports Required
Release Notes
Bug Fixes
redpandanodes during the bootstrapping process.