Backport Unstable to 9.1 for RC2#3519
Open
sarthakaggarwal97 wants to merge 46 commits intovalkey-io:9.1from
Open
Backport Unstable to 9.1 for RC2#3519sarthakaggarwal97 wants to merge 46 commits intovalkey-io:9.1from
sarthakaggarwal97 wants to merge 46 commits intovalkey-io:9.1from
Conversation
…#3359) The multiStateMemOverhead() function was incorrectly calculating the memory overhead for watched keys. It used sizeof(c->mstate->watched_keys) which is the size of the list structure itself, instead of sizeof(watchedKey) which is the actual per-key overhead. This was introduced in valkey-io#1405. Signed-off-by: Binbin <binloveplay1314@qq.com>
I'm developing a module to provide luajit as lua execution engine. See valkey-io#1229 for details. Unfortunately said module doesn't support debugging yet. So in order to test it with valkey tests scripting debug tests need to be skipped. This patch makes an anonymous test skippable by name. Signed-off-by: secwall <secwall@yandex-team.ru>
These are forked from the RFC instructions created by @zuiderkwast and @hpatro in https://github.com/valkey-io/valkey-rfc/pulls/1 and https://github.com/valkey-io/valkey-rfc/pulls/6. It also includes the Atomic Slot Migration design to bootstrap the folder. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Probably added by mistake during some merge of valkey-io#1566 Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
…olation (valkey-io#3375) The daily workflow was directly invoking the `valkey-unit-gtests` executable. The intended invocation is to use `gtest-parallel` to ensure that the tests are executed in isolation. Signed-off-by: harrylin98 <harrylin980107@gmail.com>
…ey-io#3380) When a MATCH pattern maps to a specific slot, `CLUSTERSCAN` can skip directly to that slot instead of walking through all slots one by one. - On `cursor 0`, starts directly at the matching slot - If cursor is behind the matching slot, jumps forward - If cursor is ahead of the matching slot, we conclude the scan as we cannot match keys - If both SLOT and MATCH are provided but target different slots, returns 0 immediately Signed-off-by: nmvk <r@nmvk.com>
Previously, our workflow used a global concurrency group, which effectively limited execution to one running job and one pending job. Any additional requests were automatically canceled, preventing a true queue from forming. We are now shifting to a model where we remove the concurrency restriction and allow jobs to queue directly on the self-hosted runner. This enables multiple workflow runs to be accepted and queued instead of being dropped. While GitHub can accept workflow triggers at a high rate (e.g., hundreds per minute), the actual execution is still constrained by runner capacity, in our case, a single runner processing one job at a time. However, queued jobs are subject to GitHub’s 24-hour timeout policy. This means any job that waits in the queue for more than 24 hours before starting will be automatically canceled (timedout). In practical terms, this approach improves reliability by eliminating premature cancellations, but the effective queue size is still bounded by how many jobs the runner can process within a 24-hour window. we could increase the number of runners to run these in parallel. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
…o#3276) Pin package manager dependencies in CI workflows to improve the Pinned-Dependencies score in OpenSSF Scorecard. Changes: - benchmark-on-label.yml, benchmark-release.yml: add `--require-hashes` to `pip install` adding on valkey-perf-benchmark repo: valkey-io/valkey-perf-benchmark#44 - ci.yml: pin `yamlfmt` to `v0.21.0` instead of `@latest` - reply-schemas-linter.yml: use npm ci with `package-lock.json` instead of unpinned npm install, package files in `utils/reply-schema-linter/` Signed-off-by: Roshaan Khatri <rvkhatri@amazon.com> Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Upload the entire results directory instead of only metrics JSON files. This includes server logs which are useful for debugging benchmark failures. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
…3209) The `valkey-cli --cluster del-node` command fails when attempting to delete unreachable or failed nodes, reporting `No such node ID` even though the node exists in the cluster topology. The root cause is the command only loads information about reachable nodes, causing the lookup to fail. This PR added a new function for loading all nodes information to solve this. ### Implementation 1. Loading all nodes from gossip: - Added `clusterManagerLoadAllInfoFromNode()` that loads both reachable and unreachable nodes from cluster gossip - Extracts common logic into `clusterManagerLoadInfoCommon()` with an `include_unreachable` flag - Keeps the original `clusterManagerLoadInfoFromNode()` unchanged to avoid affecting existing callers 2. Added success message to be consistent with other cluster commands: `[OK] Node <id> removed from the cluster.` 3. Added test coverage for `del-node` which previously had none. 4. Load slot information from gossip for unreachable nodes in `clusterManagerNodeLoadInfo()` 5. Skip unreachable primaries in `clusterManagerNodeWithLeastReplicas()` ### Testing ``` ./runtest --single unit/cluster/cli [ok]: del-node: Cannot delete node with slots (9 ms) [ok]: del-node: Delete reachable node without slots (23 ms) [ok]: del-node: Delete unreachable node without slots (1333 ms) [ok]: del-node: Cannot delete unreachable primary with slots (3368 ms) ``` ``` valkey-cli --cluster del-node 127.0.0.1:7000 eb837ea7c48908e5304eafd8b1b3ced57147c448 Could not connect to Valkey at 127.0.0.1:7002: Connection refused >>> Removing node eb837ea7c48908e5304eafd8b1b3ced57147c448 from cluster 127.0.0.1:7000 >>> Sending CLUSTER FORGET messages to the cluster... >>> WARNING: Could not connect to node 127.0.0.1:7002, unable to send CLUSTER RESET. [OK] Node eb837ea7c48908e5304eafd8b1b3ced57147c448 removed from the cluster. ``` ### Behavior change Before ``` $ valkey-cli --cluster del-node <entry-node-ip>:<entry-node-port> <failed-node-id> Could not connect to Valkey at <target-node-ip>:<target-node-port>: Connection refused >>> Removing node <id> from cluster <entry-node-ip>:<entry-node-port> [ERR] No such node ID <id> ``` After ``` $ valkey-cli --cluster del-node <entry-node-ip>:<entry-node-port> <failed-node-id> Could not connect to Valkey at <target-node-ip>:<target-node-port>: Connection refused >>> Removing node <id> from cluster <entry-node-ip>:<entry-node-port> >>> Sending CLUSTER FORGET messages to the cluster... >>> WARNING: Could not connect to node <target-node-ip>:<target-node-port>, unable to send CLUSTER RESET. [OK] Node <id> removed from the cluster. ``` ### Related Issue Fixes valkey-io#3208 --------- Signed-off-by: Yang Zhao <zymy701@gmail.com>
Signed-off-by: harrylin98 <harrylin980107@gmail.com>
…es (valkey-io#3398) ## Problem The `EntryTest.entryUpdate` unit test fails on macOS with (mentioned in valkey-io#3200): Expected: (entryMemUsage(e10)) < (current_embedded_allocation_size * 3 / 4) actual: 48 vs 48 ## Root Cause `entryMemUsage` for embedded entries reflects the actual zmalloc allocation size, which depends on the platform allocator's bucket sizes. Valkey's bundled jemalloc is configured with `LG_QUANTUM=3` (8-byte granularity), giving size classes: 8, 16, 24, 32, 40, 48, 56, 64, ... However, macOS libc uses 16-byte aligned buckets: 16, 32, 48, 64, 80, ... The test's value10 (21 chars) produces an entryReqSize of 40 bytes. Jemalloc has a 40-byte size class, so entryMemUsage returns 40. macOS rounds up to 48, which equals 3/4 of e9's 64-byte allocation, causing the strict less-than assertion to fail. ## Fix Shrink value10 from 21 to 13 characters, reducing entryReqSize from 40 to 32 bytes. Both allocators have a 32-byte bucket, and 32 < 48 holds on both platforms. ## Test All tests pass on macOS. Signed-off-by: Alina Liu <liusalisa6363@gmail.com>
**Title:** ARM NEON SIMD optimization for pvFind() in vset.c **Description:** This PR resolves valkey-io#2806. Thanks to @ranshid for guidance on testing methodology and workload design. ### Summary This PR adds ARM64 NEON SIMD optimization to the `pvFind()` function in vset.c, which performs linear pointer search in pVector. The pVector is used internally by vset to track expired fields in hash objects (HFE). The optimization processes 4 pointers per iteration using 128-bit NEON vector instructions. ### Implementation Details - Added `pvFindSIMD_NEON64()` static inline helper function using NEON intrinsics - Modified `pvFind()` to use SIMD path when `len >= 8` on ARM64 - Added `#include <arm_neon.h>` guarded by `HAVE_ARM_NEON` - No changes to function signatures or external behavior - Scalar fallback remains for non-ARM64 platforms and small vectors ### Benchmark Results #### Micro benchmark (Apple M4 Pro, 50M iterations) Scalar version: ``` len16_mid | len= 16 pos= 8 | 2.8 ns len16_last | len= 16 pos= 15 | 5.0 ns len32_mid | len= 32 pos= 16 | 5.1 ns len64_last | len= 64 pos= 63 | 15.7 ns len127_last | len=127 pos=126 | 34.0 ns len127_notfound | len=127 pos= -1 | 33.8 ns ``` NEON version: ``` len16_mid | len= 16 pos= 8 | 2.0 ns | 1.42x len16_last | len= 16 pos= 15 | 2.7 ns | 1.85x len32_mid | len= 32 pos= 16 | 2.8 ns | 1.79x len64_last | len= 64 pos= 63 | 9.2 ns | 1.71x len127_last | len=127 pos=126 | 18.1 ns | 1.88x len127_notfound | len=127 pos= -1 | 17.8 ns | 1.90x ``` #### Production HFE benchmark (stress profile, 120s duration) Workload profile: - 5000 hashes with 64-127 fields each - Short TTLs (2-10 seconds) to trigger expiration - 30% deletes, 60% updates (both trigger pvFind) - 4 concurrent threads ``` Platform Mode Avg Time Throughput Speedup ----------- ------ -------- ---------- ------- Apple M4 scalar 94.3 ns 10.61 M/s baseline Apple M4 NEON 74.4 ns 13.44 M/s 1.27x (21% faster) ``` SIMD utilization (M4 NEON): ``` - 100% of calls used SIMD path - 99.3% of elements scanned via SIMD - 74.7% of matches found in SIMD section - 25.3% found in scalar tail (last 0-3 elements) ``` ### Platform Support - **ARM64 (aarch64)**: SIMD enabled via `HAVE_ARM_NEON` - **x86_64 / other**: Falls back to scalar implementation, no behavior change --------- Signed-off-by: Ahmad Belbeisi <ahmadbelb@gmail.com> Signed-off-by: Ahmad Belbeisi <ahmad.belbeisi@tum.de> Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
…valkey-io#2227) In valkey-io#2023 (valkey-io#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in valkey-io#885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
… tests (valkey-io#3404) These failures seem to be attributed to a race condition in the Aborted test case. `rdb-key-save-delay` 10000 was being set after `$master exec` triggered the full resync. Since repl-diskless-sync-delay 0 was set, the master would immediately start streaming the RDB to the replica once it reconnected. On the ARM runner, when it's fast enough, the entire RDB generation and transfer could complete in ~78ms, before the delay was ever applied. This meant the replica would complete the swap and have 1010 keys instead of the expected 200 and there would be no async_loading window to observe or abort which led to the failures. We saw this in the daily test failure logs ``` 92948:S * RDB memory usage when created 110.85 Mb 92948:S * Done loading RDB, keys loaded: 1010, keys expired: 0 ``` The fix moves `rdb-key-save-delay 10000` to before `$master exec` to guarantee the delay is in effect on the master before the RDB generation begins. Closes valkey-io#3394, closes valkey-io#3395. Signed-off-by: Nikhil Manglore <nmanglor@amazon.com>
The RXE project should keep the same version with the CI machine, showing uname in RDMA CI job to find out the reason of kmod installing failure. Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Fixes valkey-io#3299 Add brief guidance in valkey.conf explaining what the listpack thresholds control and the memory/CPU tradeoff when tuning them. --------- Signed-off-by: Tarte <emprimula@gmail.com> Signed-off-by: KimHuiSu <101166683+Tarte12@users.noreply.github.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…io#3416) fixes: valkey-io#3200 --------- Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
To avoid freeing the cluster link when EAGAIN occurs, so that we can try again and keeping the send messages. Signed-off-by: Binbin <binloveplay1314@qq.com>
…htable (valkey-io#3360) Previously, watchForKey() checked for duplicate watched keys by iterating through the client's entire watched_keys list with O(N) complexity, where N is the total number of keys watched by the client. So the time complexity for the WATCH command could be quite poor and become a slow command. This commit introduces a per-db hashtable (watched_keys_by_db) in the client's multiState structure to enable O(1) duplicate key detection. The hashtable is lazily allocated only when the client starts watching keys, minimizing memory overhead for clients that don't use WATCH. The per-db hashtable stores watchedKey* directly as the hashtable entry since it already contains the key, so no custom destructors are needed. Memory management remains centralized in the watched_keys list. This optimization is especially beneficial when a client watches many keys across different databases, as the check no longer scales with the total watched key count. This might be a minor scenario, but there's no harm in optimizing it. There is a test in multi.tcl, before this patch, it took 15s, and after this patch, it only took 50ms. ``` set elements {} for {set i 0} {$i < 50000} {incr i} { lappend elements key-$i } r watch {*}$elements r watch {*}$elements ``` Signed-off-by: Binbin <binloveplay1314@qq.com>
In Daily test runs with the `--accurate` flag, the corrupt-dump-fuzzer test runs for 10 minutes (600 seconds) with "sanitize_dump: no" and then another 10 minutes with "sanitize_dump: yes". This causes the runner to time out the whole test job to be aborted with a sigterm from the runner. Example: 13697:signal-handler (1774917673) Received SIGTERM scheduling shutdown... This change reduces this hard-coded fuzzer run from 10 to 1 minute. We have many tests jobs so the fuzzer gets plenty of time to run anyway. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…#3424) The test case "The best replica can initiate an election immediately test" has been failing in CI jobs. Increase the timeout to account for slow runners. Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds. Intoduced in valkey-io#2227. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…ty primary (valkey-io#2811) ## Summary This PR handles a network race condition that would cause a replica to read stale cluster packet and then incorrectly promote itself to an empty primary within an existing shard. ## Issue "Migrated replica reports zero repl offset and rank, and fails to win election - sigstop" the test case in `replica-migration.tcl` is a known example where the network race condition can occur. Here's the timeline: - T1: Slot migration — R7 and R3 both try to replicate from R0. Only R3 succeeds, triggering a BGSAVE on R0, while R7 blocks in `receiveSynchronousResponse()`. - T2: While R7 is blocked, R0 is SIGSTOP'd by the test. R4 wins the election and becomes the new primary. R4 sends PINGs to R7 - These PINGs land in R7's kernel TCP receive buffer but are not read by R7 yet. - T3 (5s after T1, due to receiveSynchronousResponse): R7 wakes up: - It reads from inbound links and finds out the remote nodes have already closed their end (they detected R7 as FAIL), getting "I/O error: connection closed" on each. R7 calls `accept()` for the new connections that were established during the block. These connections carry data that was sent seconds ago while R7 was dead. - R7's outbound links are still valid, so it sends PING to R4. - T4: R7 receives and processes fresh PONG packet from R4 on outbound link, reconfiguring itself to follow R4. - T5: R7 reads stale PING packet of R4 via inbound link. This packet was generated R4 was following R0, so R7 incorrectly believes R4 is still following R0 now. And Reconfiguring itself as a replica of R0 from R4. - T6: R7 finds R0 is FAIL, so it starts an election and wins, and becomes an empty-primary. ## Analyze So in T4, R4 is the new primary, and R7 is reconfiguring itself as a replica. So in R7's view, R4 is the primary, and myself (R7) is a replica. And in T5, there is a stale packet from sender (R4), the stale packet is saying: sender (R4) is a replica and R0 is the primary. We originally had a logic for stale packet, meaning we would try to ignore stale packet that would cause exceptions. So in T5: - sender_claims_to_be_primary is false since R4 is saying it is a replica. - sender_last_reported_as_primary is true since in R7's view, R4 is a primary. - sender_claimed_primary is R0, and sender (R4) and sender_claimed_primary (R0) is in the same shard. - nodeEpoch(sender_claimed_primary) is R0's epoch. R0 is an old and dead (not yet) primary. - sender_claimed_config_epoch is R0's epoch since R4 is a replica, and R0 is R4's primary. - nodeEpoch(sender_claimed_primary) == sender_claimed_config_epoch, so the logic fail and we process a stale packet. So in this point, the packet is not a stale packet in R4 and R0's view. - But it is a stale packet in myself (R7) view. In R7's local view, R4 is the new primary and it should have a bigger epoch, that is nodeEpoch(sender) should > sender_claimed_config_epoch. ## Fix The PR fixes the issue by enhancing the existing guardrail logic against stale packet. Previously that logic only detects `nodeEpoch(sender_claimed_primary) > sender_claimed_config_epoch` as stale packet, now it also checks `nodeEpoch(sender) > sender_claimed_config_epoch` to make sure we have up-to-date primary-replica chain. Signed-off-by: Zhijun <dszhijun@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com>
Fixing multiple flaky tests.
slave buffer are counted correctly in tests/unit/maxmemory.tcl
Memory efficiency with values in range * in tests/unit/memefficiency.tcl
These tests send large numbers of pipelined commands using deferring
clients without reading replies, causing the server's client output
buffer to grow. On slow CI runners, this leads to TCP backpressure and
I/O errors that crash the test runner. Fix: Use CLIENT REPLY OFF to
suppress reply generation, matching the pattern from commit 87d2330.
---
Sub-replica reports zero repl offset and rank, and fails to win election
in tests/unit/cluster/replica-migration.tcl
New non-empty replica reports zero repl offset and rank, and fails to
win election in tests/unit/cluster/replica-migration.tcl
In the replica-migration tests, a MOVED errors results in an Tcl
exception. After failover, wait_for_condition blocks issue GET commands
to cluster nodes that may not have fully updated their slot routing. An
unhandled MOVED exception crashes the test runner. Fix: Wrap the
condition in catch so MOVED errors are retried. Also wrap debug prints
in the else clause. Fixes the following tests:
---
Replica can update the config epoch when trigger the failover -
automatic in tests/unit/cluster/failover2.tcl
Increase wait timeout for failover expiry. The test waits 10 seconds for
"Failover attempt expired", but the default cluster-node-timeout in
start_cluster is 3000ms, making auth_timeout 6 seconds plus ~3
seconds for failure detection — barely fitting in 10 seconds and failing
on slow CI runners. Fix: Increase wait from 1000×10ms to 1200×50ms
(60 seconds).
---
dual-channel-replication lazyfree test
The test looks up the replica's main-channel connection id after writing
50MB of data. On slow CI runners, the replica connection may have been
disconnected by the output buffer soft limit (64MB/60s) before the
lookup, causing get_client_id_by_last_cmd to return empty. Two changes:
1. Move the connection id lookup before the write loop, while the sync
is known to be in progress.
2. Reduce writes from 50 x 1MB to 10 x 1MB. The test only needs enough
data to exceed the lazyfree threshold (64 blocks ~= 1MB). 10MB is
sufficient and avoids approaching the output buffer limit.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Found while implementing `clusterNodeGetSlotRangeEnd` for `CLUSTERSCAN`
range bounded scanning, which uses the simlilar approach.
When tested on s390x via Docker and QEMU. Slot boundary came as 5440
instead of 5461.
```
127.0.0.1:6001> clusterscan 0-{06S}-0
1) "0-{4HD}-0"
2) 1) "{06S}key1"
127.0.0.1:6001> keyslot 0-{4HD}-0
(error) ERR unknown command 'keyslot', with args beginning with: '0-{4HD}-0'
127.0.0.1:6001> cluster keyslot 0-{4HD}-0
(integer) 5440
127.0.0.1:6001> cluster nodes
08e28d7e8dcfc731ac537d0518bfa32577da6ec7 127.0.0.1:6002@16002 master - 0 1774415944347 2 connected 5461-10922
53f6d4e61eee13ea98441eec05b3c4c95c6c83a4 127.0.0.1:6003@16003 master - 0 1774415943314 3 connected 10923-16383
46664ac624f001cc0708e8ddcfcc9e45e8f444a0 127.0.0.1:6001@16001 myself,master - 0 0 1 connected 0-5460
```
Updating here as it is same approach and does not follow `memrev64ifbe`
followed by `memcpy`
Signed-off-by: nmvk <r@nmvk.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Replace 'const int seqBufferMaxLength' with a #define to avoid a variable-length array warning (-Wgnu-folding-constant) in C. Also add -Werror to the linenoise Makefile so future warnings are caught at build time. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
…3420) valkey-cli and valkey-benchmark link only a small subset of .o files and do not include cluster_migrateslots.o. At -O3 this works because the compiler either inlines or discards the unused static function — the symbol reference never reaches the linker. At -O0 (no inlining, no dead-code elimination), the compiler emits the full body of getClientType into every .o that includes server.h, producing an unresolved reference to _isImportSlotMigrationJob at link time. fix is to avoid including server.h as part of external application compilation dependency fixes: valkey-io#3415 --------- Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
…r Valkey commands. (valkey-io#3309) I was fixing grammar issues in the valkey-swift client (valkey-io/valkey-swift#357), and would like to apply the same fixes - missing words, full sentence punctuation, and imperative verb forms - to the breadth of the commands. --------- Signed-off-by: Joe Heck <j_heck@apple.com> Signed-off-by: Joseph Heck <j_heck@apple.com> Co-authored-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com> Co-authored-by: Lucas Yang <lucasyonge@gmail.com>
Update Lucas Affiliation to Percona Signed-off-by: Lucas Yang <lucasyonge@gmail.com>
This PR increases the embedded string threshold to 128 bytes (2 cache lines) in order to improve memory overhead. I also fixed the tests that pertained to embedded values. Closes valkey-io#3025 --------- Signed-off-by: Nikhil Manglore <nmanglor@amazon.com>
) Fix two things: 1. Incorrect RDB object size reported when writing hash fields with expiration, because of wrong position of parentheses. Affects the serialized size reported in DEBUG OBJECT. 2. Memory leak when loading a zipmap (only used in really old RDB versions). Signed-off-by: charsyam <charsyam@naver.com> Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
Some test cases write thoughsands of commands in a pipeline and afterwards read the replies. This can lead to TCP ACK being dropped and the connection broken. CLIENT REPLY OFF prevents this. "Main db not affected when fail to diskless load" in cluster/diskless-load-swapdb has been observed to be flaky. The others are just defensive but they follow the same pattern. Similar fixes in the past: valkey-io#3430, valkey-io#2483. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The faster-failover.tcl test starts 12 cluster nodes 5 times with TLS, which is CPU intensive due to the volume of TLS handshakes. Running alongside other tests on CI runners causes random I/O timeouts in unrelated tests. Tagging it as `tls:skip` Signed-off-by: nmvk <r@nmvk.com>
…y-io#3443) In valkey.conf, slot-migration-max-failover-repl-bytes allows setting to -1 to disable the limit. ``` Setting this to -1 will disable this limit ``` But slot-migration-max-failover-repl-bytes is defined as MEMORY_CONFIG and memtoull() rejects negative inputs, making it impossible to set the value to -1 via config file or CONFIG SET. ``` >>> 'slot-migration-max-failover-repl-bytes "-1"' argument must be a memory value ``` Introduce SIGNED_MEMORY_CONFIG flag for memory configs that also accept plain negative number. When memtoull() fails and this flag is set, fall back to string2ll() for parsing. Use ll2string() for CONFIG GET and rewriteConfigNumericalOption() for CONFIG REWRITE when the value is negative. Add a serverAssert in initConfigValues() to enforce that PERCENT_CONFIG and SIGNED_MEMORY_CONFIG are never combined on the same config, since both use negative values with different semantics. This means we have had this issue since it was introduced in valkey-io#1949. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…y-io#3461) Increase time to wait for bgsave to start. This wait has been seen failing in these test cases. The test case loops with 'no', 'slow', 'fast', 'all' and 'timeout' replicas, in tests/integration/replication.tcl Example: *** [err]: diskless fast replicas drop during rdb pipe in tests/integration/replication.tcl rdb child didn't terminate Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…igs (valkey-io#3440) rewriteConfigFormatMemory() and rewriteConfigBytesOption() used long long parameters to represent memory bytes, but configs like maxmemory are stored as unsigned long long and allow values up to ULLONG_MAX. When the value exceeds LLONG_MAX (e.g. 9223372036854775808), it overflows to negative when passed as long long, causing config rewrite to produce incorrect output like "maxmemory -8589934592gb". Change both functions to use unsigned long long, matching the actual semantics of memory byte values. This is consistent with how numericConfigGet() already handles MEMORY_CONFIG using ull2string(). There are some MEMORY_CONFIG are signed and can be negative: 1. maxmemory-clients is a PERCENT_CONFIG 2. slot-migration-max-failover-repl-bytes is a SIGNED_MEMORY_CONFIG (after valkey-io#3443) None of them were affected: ``` static void numericConfigRewrite(standardConfig *config, const char *name, struct rewriteConfigState *state) { ... if (config->data.numeric.flags & PERCENT_CONFIG && value < 0) { rewriteConfigPercentOption(state, name, -value, config->data.numeric.default_value); } else if (config->data.numeric.flags & SIGNED_MEMORY_CONFIG && value < 0) { rewriteConfigNumericalOption(state, name, value, config->data.numeric.default_value); } else if (config->data.numeric.flags & MEMORY_CONFIG) { rewriteConfigBytesOption(state, name, value, config->data.numeric.default_value); ... ``` Signed-off-by: Binbin <binloveplay1314@qq.com>
In rdbLoadObject, when sdstrynewlen is OK and hashtableAdd is OK, but lpSafeToAdd is FAIL, field will be double-freed. But lpSafeToAdd will only fail when len + add > LISTPACK_MAX_SAFETY_SIZE(1GB), so it can rarely happened and trivial. Signed-off-by: charsyam <charsyam@naver.com>
…-io#3463) Two changes to tests/unit/cluster/faster-failover.tcl: 1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50` (10s → 50s) 2. psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s) The test `The best replica can initiate an election immediately in an automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been flaky since it was introduced on March 27, 2026 by valkey-io#2227. **Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners. **Common errors:** - `log message of "Successful partial resynchronization with primary" not found` (timeout waiting for psync) - `expected pattern found in srv -N log file: *best ranked replica*` (timeout waiting for FAIL propagation) The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. A previous fix attempt valkey-io#3424 increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
…valkey-io#3471) Previously, `redactClientCommandArgument()` forced a full copy of `c->argv` into `original_argv` immediately, then replaced the sensitive slot with `shared.redacted`. This meant any call to redact an argument triggered an eager argv backup even when no rewriting was needed. The new approach stores the indices of arguments to be redacted in a small dynamic array (`redact_args`) on the client struct. Redaction is applied lazily when the argv is actually consumed (e.g. by the command log), avoiding the unnecessary argv copy on the hot path. This change is required for the implementation of `ValkeyModule_CallArgv` module API function (valkey-io#3122) , because CallArgv owns the `argv` array, which can't be modified by the redact function. --------- Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
…havior (valkey-io#3372) This commit introduces a new configuration option `cluster-config-save-behavior` that controls how the cluster handles nodes.conf file save failures. The option supports two modes: - `sync` (default): Synchronously save the config file. If the save fails, the process exits. This maintains backward compatibility with the old behavior (before 9.1). - `best-effort`: Synchronously save the config file. If the save fails, only log a warning and continue running. This allows the node to survive disk failures (e.g., disk full, read-only filesystem) without exitting, giving administrators time to address the issue. Note that this modifies the behavior of valkey-io#1032, whereas valkey-io#1032 was "best-effort", we have now introduced a configuration option that defaults to "sync." See valkey-io#1032 discussion for more details. Background: When a disk becomes read-only or full, any cluster metadata change would trigger a nodes.conf save attempt. With the old behavior, the node would immediately exit via clusterSaveConfigOrDie(), potentially causing multiple nodes on the same machine to crash simultaneously, leading to cluster unavailability. The new `best-effort` mode addresses this by allowing nodes to continue operating even when disk writes fail. This is particularly useful in cloud environments where disk failures are more common due to scale. Note: Startup-time config saves (in clusterInit and verifyClusterConfigWithData) still use clusterSaveConfigOrDie() since disk issues at startup should cause immediate failure. Signed-off-by: Binbin <binloveplay1314@qq.com>
This PR introduces `ValkeyModule_CallArgv`, a new low-level module API
for calling server commands. It is designed as a complement to the
existing `ValkeyModule_Call` covering two scenarios where
`ValkeyModule_Call` is suboptimal:
1. **The module already holds the arguments as an argv array.**
`ValkeyModule_Call` always allocates a new argv array, creates a new
string object for the command name, and calls
`incrRefCount`/`decrRefCount` on every argument. `ValkeyModule_CallArgv`
borrows the caller's argv array directly — no allocation, no refcount
manipulation. Ownership of the array stays with the caller.
2. **The module wants to forward the reply to its own client without
inspecting it.** `ValkeyModule_Call` always parses the raw RESP bytes
into an intermediate `ValkeyModuleCallReply` tree and then re-serializes
it back to RESP to send to the caller. `ValkeyModule_CallArgv` exposes
the raw RESP bytes directly through a callback, allowing a pass-through
module command to copy bytes from the inner command's output buffer
straight into the outer client's output buffer.
---
## New API
### `ValkeyModule_CallArgv`
```c
int ValkeyModule_CallArgv(ValkeyModuleCtx *ctx,
ValkeyModuleString **argv,
int argc,
int flags,
ValkeyModuleReplyHandlers *reply_handlers,
void *reply_ctx);
```
Calls a server command using an existing argv array. The caller retains
ownership of `argv` and its elements — they are never freed or
ref-counted by the function.
Returns `VALKEYMODULE_OK` on success or `VALKEYMODULE_ERR` on error,
with `errno` set to the same values as `ValkeyModule_Call`. The function
does **not** return a `ValkeyModuleCallReply`; the reply is delivered
through `reply_handlers` instead.
`flags` is a bitwise OR of one or more `VALKEYMODULE_CALL_ARGV_*`
constants, which correspond directly to the format-string specifiers of
`ValkeyModule_Call`:
| Flag | Equivalent format char | Meaning |
|---|---|---|
| `VALKEYMODULE_CALL_ARGV_REPLICATE` | `!` | Propagate to replicas and
AOF |
| `VALKEYMODULE_CALL_ARGV_NO_AOF` | `A` | Skip AOF propagation |
| `VALKEYMODULE_CALL_ARGV_NO_REPLICAS` | `R` | Skip replica propagation
|
| `VALKEYMODULE_CALL_ARGV_RESP_3` | `3` | Force RESP3 for the inner call
|
| `VALKEYMODULE_CALL_ARGV_RESP_AUTO` | `0` | Match the calling client's
protocol (RESP2 or RESP3) |
| `VALKEYMODULE_CALL_ARGV_RUN_AS_USER` | `C` | Apply ACL checks for the
context user |
| `VALKEYMODULE_CALL_ARGV_SCRIPT_MODE` | `S` | Mark as script execution
|
| `VALKEYMODULE_CALL_ARGV_NO_WRITES` | `W` | Disallow write commands |
| `VALKEYMODULE_CALL_ARGV_ERRORS_AS_REPLIES` | `E` | Return error
replies instead of raising errno |
| `VALKEYMODULE_CALL_ARGV_RESPECT_DENY_OOM` | `M` | Honour deny-oom
policy |
| `VALKEYMODULE_CALL_ARGV_DRY_RUN` | `D` | Dry-run mode (implies
`ERRORS_AS_REPLIES`) |
| `VALKEYMODULE_CALL_ARGV_ALLOW_BLOCK` | `K` | Allow the inner command
to block |
| `VALKEYMODULE_CALL_ARGV_REPLY_EXACT` | `X` | Disable reply type
coercion |
---
### `ValkeyModuleReplyHandlers`
```c
typedef struct ValkeyModuleReplyHandlers {
/* Scalar types */
void (*null)(void *ctx);
void (*bulkString)(void *ctx, const char *str, size_t len);
void (*simpleString)(void *ctx, const char *str, size_t len);
void (*error)(void *ctx, const char *msg, size_t len);
void (*integer)(void *ctx, long long val);
void (*doubleVal)(void *ctx, double val);
void (*boolVal)(void *ctx, int val);
void (*bigNumber)(void *ctx, const char *str, size_t len);
void (*verbatimString)(void *ctx, const char *str, size_t len, const char *fmt);
/* Collection types — paired start/end callbacks */
void (*arrayStart)(void *ctx, size_t len);
void (*arrayEnd)(void *ctx);
void (*mapStart)(void *ctx, size_t len);
void (*mapEnd)(void *ctx);
void (*setStart)(void *ctx, size_t len);
void (*setEnd)(void *ctx);
void (*attributeStart)(void *ctx, size_t len);
void (*attributeEnd)(void *ctx);
/* Invoked on parse error */
void (*replyParsingError)(void *ctx);
/* Raw-bytes intercept — called before per-type callbacks */
int (*onRespAvailable)(void *ctx, ValkeyModuleCtx *module_ctx, const char *proto, size_t proto_len);
/* Blocking command support */
void (*onBlocked)(void *ctx, ValkeyModuleCtx *module_ctx, ValkeyModuleCallArgvBlockedHandle *handle);
void *context; /* Passed as first argument to every callback */
} ValkeyModuleReplyHandlers;
```
All fields are optional; set unused ones to `NULL`.
**`onRespAvailable`** is called first, before any per-type callback,
with the complete raw RESP reply in the `proto`/`proto_len` arguments.
Return `1` to continue into the per-type callbacks; return `0` to stop
after the raw bytes. Pass `NULL` to skip and go straight to per-type
dispatching.
**`onBlocked`** is only called when `VALKEYMODULE_CALL_ARGV_ALLOW_BLOCK`
is set and the inner command blocks. The `handle` can be passed to
`ValkeyModule_CallArgvAbort` to cancel the call. The handle is valid
until either `onRespAvailable` is invoked or
`ValkeyModule_CallArgvAbort` is called, whichever comes first.
---
### `ValkeyModule_CallArgvAbort`
```c
int ValkeyModule_CallArgvAbort(ValkeyModuleCallArgvBlockedHandle *handle);
```
Cancels a blocking call whose `onBlocked` callback has already fired.
Returns `VALKEYMODULE_OK` if the abort succeeded (neither
`onRespAvailable` nor any other reply callback will be invoked), or
`VALKEYMODULE_ERR` if the call has already completed.
---
### `ValkeyModule_ReplyRaw`
```c
int ValkeyModule_ReplyRaw(ValkeyModuleCtx *ctx, const char *proto, size_t proto_len);
```
Appends raw RESP bytes directly to the calling client's output buffer,
with no parsing or re-serialization. Intended for use inside an
`onRespAvailable` callback to implement zero-copy reply forwarding.
---
## Usage Example: `ValkeyModule_Call` vs `ValkeyModule_CallArgv`
The following two implementations of a generic pass-through proxy
command are functionally equivalent, but use different code paths
internally.
### With `ValkeyModule_Call` (existing API)
```c
int ProxyCommand(ValkeyModuleCtx *ctx, ValkeyModuleString **argv, int argc) {
if (argc < 2) return ValkeyModule_WrongArity(ctx);
/* VM_Call builds a new argv array internally, creates a string object for
* the command name, and calls incrRefCount/decrRefCount on every argument.
* The reply is parsed into a CallReply tree, then re-serialized to RESP. */
ValkeyModuleCallReply *reply =
ValkeyModule_Call(ctx, ValkeyModule_StringPtrLen(argv[1], NULL), "v",
argv + 2, (size_t)(argc - 2));
return ValkeyModule_ReplyWithCallReply(ctx, reply);
}
```
### With `ValkeyModule_CallArgv` and `onRespAvailable` (new API —
pass-through)
```c
static int forwardRawReply(void *ctx, ValkeyModuleCtx *mctx,
const char *proto, size_t proto_len) {
(void)ctx;
/* Write the raw RESP bytes directly to the outer client — no parse, no
* re-serialize, no intermediate allocation. */
ValkeyModule_ReplyRaw(mctx, proto, proto_len);
return 0; /* stop — do not invoke per-type callbacks */
}
static ValkeyModuleReplyHandlers passthrough_handlers = {
.onRespAvailable = forwardRawReply,
};
int ProxyCommand(ValkeyModuleCtx *ctx, ValkeyModuleString **argv, int argc) {
if (argc < 2) return ValkeyModule_WrongArity(ctx);
/* VM_CallArgv borrows argv directly (no copy, no refcount changes).
* The reply is written straight into the outer client's output buffer
* through the onRespAvailable callback above. */
if (ValkeyModule_CallArgv(ctx, argv + 1, argc - 1,
VALKEYMODULE_CALL_ARGV_RESP_AUTO,
&passthrough_handlers,
NULL) == VALKEYMODULE_ERR) {
return ValkeyModule_ReplyWithError(ctx, "ERR command failed");
}
return VALKEYMODULE_OK;
}
```
Use `VALKEYMODULE_CALL_ARGV_RESP_AUTO` to ensure the inner command
replies in the same protocol version as the outer client (RESP2 or
RESP3), so the raw bytes can be forwarded without modification.
---
## Internal changes
### `argv_borrowed` client flag
A new `argv_borrowed` bit is added to the client flags struct. When
`VM_CallArgv` runs the inner command on a temporary client, it sets this
flag to signal that the temp client does not own the argv array.
`freeClientArgv` and `freeClientOriginalArgv` respect the flag and skip
freeing. `resetClient` clears it unconditionally after each command.
Several string commands (`SET`, `MSET`, `APPEND`, …) that previously
encoded or reused argv objects in-place are guarded so they do not
mutate argv elements they do not own.
A debug-only assertion (enabled when `enable-debug-assert yes` is set in
the config) verifies at runtime that no command modifies the argv array
or decrements any element's ref-count when `argv_borrowed` is set.
---------
Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
Signed-off-by: Ricardo Dias <rjd15372@gmail.com>
#### Summary
This PR redesigns the IO threading communication model, replacing the
inefficient client-list polling approach with a high-performance,
lock-free queue architecture. This change improves throughput by
**8–17%** across various workloads and lays the groundwork for
offloading command execution to IO threads in following PRs.
### Performance Comparison: Unstable vs New IO Queues
| Type | Operation | Unstable Branch (M TPS) | New IO Queues (M TPS)|
Difference (%) |
| :--- | :--- | :--- | :--- | :--- |
| **CME**<sup>1</sup> | SET | 1.02 | 1.19 | **+16.67%** |
| **CME** | GET | 1.30 | 1.47 | **+13.08%** |
| **CMD**<sup>2</sup> | SET | 1.15 | 1.35 | **+17.39%** |
| **CMD** | GET | 1.52 | 1.64 | **+7.89%** |
<sup>1</sup> Amazon terminology for cluster mode
<sup>2</sup> Amazon termonology for standalone mode, i.e. config
`cluster-enabled no`
- Test Configuration: 8 IO threads • 400 clients • 512-byte values • 3M
keys
#### Motivation
The previous IO model had several limitations that created performance
bottlenecks:
* **Inefficient Polling:** The main thread lacks a direct notification
mechanism for completed work. Instead, it must constantly iterate
through a list of all pending clients to check their state, wasting
significant CPU cycles.
* **Manual Load Balancing:** Jobs are assigned to specific threads
upfront. This requires the main thread to predict which thread to use,
often leaving some threads idle while others are overloaded.
* **Static Scaling:** Thread activation relies on a fixed heuristic
(e.g., 1 thread per 2 events). This approach fails to adapt to varying
workloads, such as TLS connections or differing read/write sizes.
### The Solution
To address these inefficiencies, this PR replaces the single SPSC queue
used currently with three specialized queues to handle communication and
load balancing more effectively.
#### 1. Main > IO: Shared Queue (Single Producer Multi Consumer)
Single queue from the main-thread to IO threads.
* **Automatic Load Balancing:** All threads pull from the same source.
Busy threads take less work, and idle threads take more, so we don't
need to manually select a thread.
* **Adaptive Scaling:** We now use the queue depth to decide when to add
or remove threads. If the queue is full, we scale up; if it's empty, we
scale down.
* *Ignition:* To get things started before the queue fills up, we
monitor the main thread's CPU. If usage goes over 30%, we wake up the
first IO thread.
* **Implementation:** To prevent contention among consumers, each item
in the ring buffer is padded to reside in its own cache line. Sequence
numbers are utilized to indicate whether a cell is empty or populated,
allowing threads to safely claim work.
#### 2. IO > Main: The Response Channel (MPSC Queue)
We replaced the old polling loop with a response queue.
* ** Faster Completion:** IO threads push completed jobs into this
queue. The main thread detects new data simply by checking if the queue
is not empty, removing the need to scan pending clients.
* **Contention Management:** To avoid lock contention, each thread
reserves a slot by atomically incrementing the tail index. In the rare
event that the queue is full, pending jobs are buffered in a local
temporary list until space becomes available.
#### 3. MAIN > IO (Thread-Specific): Private Inbox (SPSC Queue)
We kept the existing Single-Producer Single-Consumer (SPSC) queues for
tasks that must happen on a specific thread (like freeing memory
allocated by that thread). IO threads always check their private inbox
before looking at the shared queue.
### Changes Required
* **Async client release**
The main thread no longer busy-waits for IO threads to finish with a
client. Since the client must be popped from the multi-producer queue
before it can be released, clients with pending IO are now marked for
asynchronous closure.
* **eviction clients logic**
Updated evictClients() to account for memory pending release (clients
marked close_asap). freeClient() now returns a status code (1 for freed,
0 for async-close) to ensure the eviction loop does not over-evict by
ignoring memory that is about to be reclaimed.
* **events-per-io-thread config**
Replaced the `events-per-io-thread` configuration with
`io-threads-always-active`. as we no longer track events, since this
config is use only for tests no backward compatibility issue arises.
* **packed job instead of handlers**
Jobs are now represented as tagged pointers (using lower 3 bits for job
type) instead of separate `{handler, data}` structs. This reduces memory
overhead and allows jobs to be passed through the queues as single
pointers.
* **head caching in spsc queue**
The SPSC queue now caches the `head` index on the producer side
(`head_cache`) to avoid frequent atomic loads. The producer only
refreshes from the atomic `head` when the cache indicates the queue
might be full, reducing cross-thread cache-line bouncing.
* **deferred commit in SPSC queue**.
`spscEnqueue()` supports batching via a `commit` flag. Multiple jobs can
be enqueued with `commit=false`, then flushed with a single
`spscCommit()` call, reducing atomic operations and cache-line bouncing.
* **rollback on fullness check failure**
When `spmcEnqueue()` fails due to a full queue, the client state is
rolled back (e.g., `io_write_state` reset to `CLIENT_IDLE`). This
rollback approach removes the need to call an expensive `isFull` check
before every enqueue, we just attempt the enqueue and revert if it
fails.
* **epoll offloading via SPSC at high thread counts**.
When `active_io_threads_num > 9`, poll jobs are sent to per-thread SPSC
queues (round-robin). Since threads check their private queue first,
this ensures poll jobs are processed promptly without waiting behind
jobs in the shared SPMC queue.
* **avoid offload write before read comes back**
Added a check `if (c->io_read_state == CLIENT_PENDING_IO) return C_OK`
in `trySendWriteToIOThreads()`. In the previous per-thread SPSC
implementation, we could send consecutive read and write jobs for the
same client knowing a single thread would handle them in order. With the
shared SPMC queue, different threads may pick up the jobs, so we must
wait for the read to complete before sending a write to avoid 2 threads
handling the same client.
* **removing pending_read_list_node from client and
clients_pending_io_read/write lists from server**
Removed `pending_read_list_node` from the `client` struct and
`clients_pending_io_read`/`clients_pending_io_write` lists from
`valkeyServer`. as the new mpsc eliminates the need for these tracking
structures.
* **added inst metrics for pending io jobs**
Added `instantaneous_io_pending_jobs` metric via `STATS_METRIC_IO_WAIT`
to track average queue depth over time.
* **added stat for current active threads number**
Added `active_io_threads_num` to the INFO stats output for better
visibility.
* **added internal inst metric for main-thread cpu (non apple
compliant)**
Added `STATS_METRIC_MAIN_THREAD_CPU_SYS` to track main thread CPU usage
via `getrusage(RUSAGE_THREAD)`. This powers the "ignition" policy, when
CPU exceeds 30%, the first IO thread is activated. `RUSAGE_THREAD` is
Linux-specific, so macOS falls back to event-count heuristics.
* **added stat for pending read and writes for io**
Added `io_threaded_reads_pending` and `io_threaded_writes_pending` stats
to track how many read/write jobs are currently in-flight to IO threads.
* **added volatile for crashed**
Changed `server.crashed` from `int` to `volatile int` to ensure the
crash flag is visible across threads immediately, allowing IO threads to
detect a crash and stop sending responses back to the main thread to
avoid deadlock on crash.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: akash kumar <akumdev@amazon.com>
Co-authored-by: Uri Yagelnik <uriy@amazon.com>
Co-authored-by: Dan Touitou <dan.touitou@gmail.com>
…key-io#3499) Minor cleanup, the function will return -1 on error. It worked anyway before because it is implicitly converted: -1 => SIZE_MAX => -1 again when it's implicitly cast back to ssize_t. Signed-off-by: Binbin <binloveplay1314@qq.com>
The test introduced in valkey-io#2227 is fragile because it unconditionally asserted that "best ranked replica" would be logged for every shard's replica. This assumption was incorrect for two reasons: 1. myselfIsBestRankedReplica() requires failover_failed_primary_rank == 0. When two primaries fail simultaneously, only the shard with the lowest shard_id gets failed_primary_rank == 0. The other shard's replicas can only trigger "Myself become the best ranked replica" after the first shard completes failover and the result propagates via gossip. 2. clusterAllReplicasThinkPrimaryIsFail() requires all sibling replicas to have their MY_PRIMARY_FAIL flag propagated via gossip. If the election delay is shorter than the gossip propagation time, the replica may start the election before receiving the flag. It's also too strict and checks too many things. Like we can check that there is no delay before starting the failover, but we shouldn't check that the rank 0 replica actually wins, because maybe it doesn't. Also we're not sure the other replica does a psync afterwards, it can also do a full sync. Restructure the test into three focused scenarios with retry logic: - Test 0 (3 primaries + 1 replica): The sole replica is deterministically the best ranked replica. All conditions are trivially satisfied. - Test 1 (3 primaries + 2 replicas): Single primary failure with two competing replicas. failed_primary_rank is deterministically 0, but MY_PRIMARY_FAIL gossip timing may prevent triggering in some runs. Uses internal retry with cluster recovery to ensure at least one successful trigger. - Test 2 (5 primaries + 7 replicas): Two primaries failure. Verifies no failover timeout and uses retry to cover the "best ranked replica" path for at least one shard. Signed-off-by: Binbin <binloveplay1314@qq.com>
eeb44a9 to
0c8f75f
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 9.1 #3519 +/- ##
==========================================
+ Coverage 74.57% 76.46% +1.88%
==========================================
Files 130 161 +31
Lines 72731 80488 +7757
==========================================
+ Hits 54239 61544 +7305
- Misses 18492 18944 +452
🚀 New features to boost your workflow:
|
Member
|
@sarthakaggarwal97 can you also backport #3306? It is targeted for 9.0 and 9.1, just got merged, would be awesome to include it here. Also do we need to mention it in release notes? Some users may see more client evictions that they have not seen previously because of improved tracking |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR helps support backporting commits from unstable to 9.1 Branch for RC2 Release
Cherry-picks 46 of 50 commits from
unstableinto9.1, covering all PRs marked "To be backported" in the Valkey 9.1 project board and other PRs which are fixes / ci improvements.Bug Fixes (10)
valkey-cli --cluster del-nodefor unreachable nodes #3209 Fixvalkey-cli --cluster del-nodefor unreachable nodesPerformance & Features (8)
Test Fixes (9)
CI & Infrastructure (10)
Build Fixes (2)
Documentation & Maintenance (5)