Skip to content

Streaming Compression support for RDB#3531

Open
sarthakaggarwal97 wants to merge 3 commits intovalkey-io:unstablefrom
sarthakaggarwal97:streaming-compression-rio-pr
Open

Streaming Compression support for RDB#3531
sarthakaggarwal97 wants to merge 3 commits intovalkey-io:unstablefrom
sarthakaggarwal97:streaming-compression-rio-pr

Conversation

@sarthakaggarwal97
Copy link
Copy Markdown
Contributor

@sarthakaggarwal97 sarthakaggarwal97 commented Apr 17, 2026

Related to #3195

Overview

Today, rdbcompression only affects individual string payloads inside an otherwise normal RDB stream. This PR adds Valkey Compressed Stream (VKCS) support for RDB persistence, with lz4 as the first supported whole-stream codec.

The RDB can now be wrapped in a VKCS envelope and compressed as a single stream at the rio layer. The default behavior remains unchanged with lzf. On load, the RDB path probes for the envelope first. If present and valid, the input is transparently decompressed before normal RDB parsing continues. If absent, loading falls back to the existing plain RDB path. If the envelope is malformed or incompatible, the load fails early.

Architecture

+------------------------+       +------------------------------+       +----------------------+
|    RDB save/load       |       |         rio wrappers         |       |      transport       |
|                        |       |                              |       |                      |
| rdbSaveRio             |------>| compress_rio_t               |------>| dump.rdb / sync RDB  |
| rdbLoadRio             |<------| decompress_rio_t             |<------| file rio             |
| rdbSaveRawString       |       | stream_writer / reader       |       |                      |
| valkey-check-rdb       |       | VKCS envelope + codec state  |       |                      |
+------------------------+       +------------------------------+       +----------------------+

Key Design Decisions

  1. Whole-stream compression is implemented as a rio decorator rather than as part of object serialization.
  2. VKCS adds a small envelope ahead of the compressed payload so loaders can classify the stream before normal RDB parsing begins.
  3. lzf remains the default and plain RDB compatibility is preserved.
  4. Per-string LZF is skipped when whole-stream compression is active, so the two compression modes are not stacked.
  5. stream_kind validation rejects incompatible VKCS streams early.
  6. rdbLoad and valkey-check-rdb both handle plain and VKCS-wrapped inputs transparently.
  7. Diskless replication stays on the existing non-VKCS path in this PR.

Data Flow: Save Path

rdbSave()
   |
   | if rdbcompression=yes and algo=lz4
   v
rioInitWithCompress(...)
   |
   | write VKCS envelope
   | feed normal RDB bytes into LZ4 frame encoder
   v
dump.rdb

Important notes:

  • The VKCS header stores magic, version, codec, flags, and stream_kind.
  • rdbchecksum yes enables codec-native integrity signaling for the compressed stream.
  • When whole-stream compression is active, the existing per-string LZF path is bypassed.

Data Flow: Load Path

rdbLoad() / valkey-check-rdb
   |
   | rdbInputStreamPrepare()
   | probe first bytes
   v
[plain RDB] ------------------------> passthrough rio -> normal RDB parser
[VKCS + kind=RDB + valid header] ---> decompress_rio -> normal RDB parser
[malformed/incompatible VKCS] ------> fail early

This PR is intentionally scoped to RDB streaming compression and the persistence/load paths that need to understand it. It does not add diskless sync and replication compression APIs.

Benchmarks

Benchmarked on r7g.2xlarge (Graviton, 8 vCPUs, 61GB RAM, NVMe). All results averaged over 3 repeats.

Datasets

  • Improved Realistic JSON: Synthetic JSON with natural language text and varied field types. This data is more compressible due to natural language. Tested at 100B–10KB value sizes.
final_improved_combined
  • BlockMesh Tweets — 1M/5M unique real tweets from BlockMesh/tweets. Multilingual, avg ~270B, zero cross-key repetition.
final_blockmesh_combined

Notes:

  1. LZ4 streaming beats LZF on every metric at every size. 30-77% faster saves, 24-73% faster loads, 45-73% smaller RDBs.
  2. LZ4 library is currently vendored from https://github.com/lz4/lz4. The decision was taken here [NEW] RDB Compression via LZ4, Batching, and Batch-Level Dictionary #1962 (comment)

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 18, 2026

Codecov Report

❌ Patch coverage is 92.92605% with 132 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.73%. Comparing base (03c2d4c) to head (a241fd7).
⚠️ Report is 9 commits behind head on unstable.

Files with missing lines Patch % Lines
src/compression_stream.c 91.20% 35 Missing ⚠️
src/compression_rio.c 79.36% 26 Missing ⚠️
src/valkey-check-rdb.c 61.29% 24 Missing ⚠️
src/rio.c 66.66% 21 Missing ⚠️
src/rdb.c 85.71% 13 Missing ⚠️
src/compression.c 91.54% 6 Missing ⚠️
src/aof.c 78.94% 4 Missing ⚠️
src/compression_lz4.c 97.59% 2 Missing ⚠️
src/unit/test_compression.cpp 99.89% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3531      +/-   ##
============================================
+ Coverage     76.21%   76.73%   +0.52%     
============================================
  Files           159      164       +5     
  Lines         81672    81858     +186     
============================================
+ Hits          62243    62815     +572     
+ Misses        19429    19043     -386     
Files with missing lines Coverage Δ
src/config.c 78.15% <100.00%> (-0.15%) ⬇️
src/rdb.h 100.00% <ø> (ø)
src/rio.h 100.00% <ø> (ø)
src/server.h 100.00% <ø> (ø)
src/unit/test_compression.cpp 99.89% <99.89%> (ø)
src/compression_lz4.c 97.59% <97.59%> (ø)
src/aof.c 80.29% <78.94%> (-0.06%) ⬇️
src/compression.c 91.54% <91.54%> (ø)
src/rdb.c 77.27% <85.71%> (+0.36%) ⬆️
src/rio.c 83.51% <66.66%> (-1.33%) ⬇️
... and 3 more

... and 121 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sarthakaggarwal97 sarthakaggarwal97 changed the title Compression: Add Streaming RIO-based RDB compression Add Streaming Compression support for RDB Apr 20, 2026
@sarthakaggarwal97 sarthakaggarwal97 changed the title Add Streaming Compression support for RDB Streaming Compression support for RDB Apr 20, 2026
@sarthakaggarwal97 sarthakaggarwal97 marked this pull request as ready for review April 21, 2026 16:13
@sarthakaggarwal97 sarthakaggarwal97 force-pushed the streaming-compression-rio-pr branch from cf051d0 to e82019b Compare April 22, 2026 21:26
Add streaming LZ4-backed RDB compression with rio decorators, stream envelope handling, integration changes, and the follow-up fixes and config cleanup needed on top of unstable.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
@sarthakaggarwal97 sarthakaggarwal97 force-pushed the streaming-compression-rio-pr branch from e82019b to 3764776 Compare April 22, 2026 21:31
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
- Remove dead code: rdbIsValidMagic() and unused #include <string.h> in rdb.h
- Remove redundant first RIO_FLAG_SKIP_RDB_CHECKSUM set in rdbSaveInternal
- Remove unrelated changes: config_parse_depth, USE_FAST_FLOAT, write-make-settings
- Validate full 8-byte VKCS envelope in aof.c rdbFileUsesStreamingCompression
- Add SAFETY comment for rdbRioHasCorruptCompressedInput cast invariant
- Rename all snake_case identifiers to camelCase per Valkey conventions:
  types (compression_algo_t -> compressionAlgo, stream_compressor_t ->
  streamCompressor, compress_rio_t -> compressRio, etc.), functions
  (stream_writer_create -> streamWriterCreate, compress_rio_finish ->
  compressRioFinish, write_vkcs_envelope -> writeVkcsEnvelope, etc.),
  and static variables (compression_lz4_codec_impl -> compressionLz4CodecImpl)
- Drop _t suffix from all types to match Valkey convention
- Fix typo: streamWriterIsErrord -> streamWriterIsErrored
- Replace silent dummy buffer allocation with assert(needed > 0) in
  streamWriterEnsureOutBuf

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant