Skip to content

perf(worker-core): replace AsyncArrowWriter with parquet-ext#1852

Merged
JohnSwan1503 merged 13 commits intomainfrom
Swanson/worker-core/performance/replace-arrow-writer-with-parquet-ext
Feb 25, 2026
Merged

perf(worker-core): replace AsyncArrowWriter with parquet-ext#1852
JohnSwan1503 merged 13 commits intomainfrom
Swanson/worker-core/performance/replace-arrow-writer-with-parquet-ext

Conversation

@JohnSwan1503
Copy link
Contributor

@JohnSwan1503 JohnSwan1503 commented Feb 24, 2026

This pull request introduces a new crate, parquet-ext, and integrates it into the workspace, primarily to provide extended and async Parquet writing capabilities. The changes include adding the new crate, updating dependencies, and refactoring the codebase to use the new async Parquet writer implementation. The most significant changes are grouped below:

1. Addition of the parquet-ext crate:

  • Added a new crate parquet-ext with support for async and zstd features, and implemented core modules for async Arrow writing, pipeline backends, and builder patterns. (crates/parquet-ext/Cargo.toml [1] crates/parquet-ext/src/arrow/async_writer/mod.rs [2] crates/parquet-ext/src/lib.rs [3] and related new files)

2. Integration into workspace and dependencies:

  • Registered crates/parquet-ext in the workspace members and added parking_lot as a workspace dependency. (Cargo.toml [1] [2]
  • Updated crates/core/worker-core to depend on the new parquet-ext crate with async and zstd features, and switched parking_lot to a workspace dependency. (crates/core/worker-core/Cargo.toml crates/core/worker-core/Cargo.tomlL20-R28)

3. Refactoring to use the new async Arrow writer:

  • Replaced usage of parquet::arrow::AsyncArrowWriter with parquet_ext::arrow::async_writer::AsyncArrowWriter in parquet_writer.rs. (crates/core/worker-core/src/parquet_writer.rs [1] [2]

4. Monitoring and logging updates:

5. Implementation details in parquet-ext:

  • Implemented core pipeline, backend, and builder modules, including async batch handling, encoder factories, and shutdown logic for background threads. (e.g., crates/parquet-ext/src/writer/pipeline/encoder/batches.rs [1] crates/parquet-ext/src/backend/mod.rs [2] crates/parquet-ext/src/writer/pipeline/encoder/executor.rs [3] crates/parquet-ext/src/writer/pipeline/encoder/factory.rs [4] crates/parquet-ext/src/writer/pipeline/backend.rs [5] crates/parquet-ext/src/builder/mod.rs [6]

These changes collectively introduce a new, extensible, and async-capable Parquet writing layer, and refactor the codebase to use this new implementation.

…rrowWriter is used.

Signed-off-by: John Swanson <jswanson@edgeandnode.com>
Copy link
Contributor

@LNSD LNSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅

\Replace the external git dependency on parquet-ext with a local
  workspace crate. The new crate provides a pipelined parquet writer
  with async support, row group encoding, progress tracking, and
  configurable writer properties. Update worker-core to use the
  new import path.

Signed-off-by: John Swanson <jswanson@edgeandnode.com>
…writer-with-parquet-ext

Signed-off-by: John Swanson <jswanson@edgeandnode.com>
Signed-off-by: John Swanson <jswanson@edgeandnode.com>
…r async operations

Signed-off-by: John Swanson <jswanson@edgeandnode.com>
…to prevent deadlock on shutdown

Signed-off-by: John Swanson <jswanson@edgeandnode.com>
…ltiple crates

Signed-off-by: John Swanson <jswanson@edgeandnode.com>
… item management

Signed-off-by: John Swanson <jswanson@edgeandnode.com>
…ered leaf collection

Signed-off-by: John Swanson <jswanson@edgeandnode.com>
Signed-off-by: John Swanson <jswanson@edgeandnode.com>
Add crate README documenting architecture, usage, feature flags, performance characteristics, and configuration. Add two criterion benchmark suites comparing parquet-ext against Arrows built-in writer: arrow_writer (encoding microbenchmark) and end_to_end_writer (ful async pipeline). Export EncoderFactory from the public API to support the encoding benchmark.

Signed-off-by: John Swanson <jswanson@edgeandnode.com>
format(benches): reorganize imports for better readability

Signed-off-by: John Swanson <jswanson@edgeandnode.com>
Signed-off-by: John Swanson <jswanson@edgeandnode.com>
@JohnSwan1503 JohnSwan1503 merged commit 3f89d56 into main Feb 25, 2026
7 checks passed
@JohnSwan1503 JohnSwan1503 deleted the Swanson/worker-core/performance/replace-arrow-writer-with-parquet-ext branch February 25, 2026 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants