-
Notifications
You must be signed in to change notification settings - Fork 0
Add backend performance workflow and tuning #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
2891602
60dd817
d9a660d
6b34d81
74836df
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| # AGENTS | ||
|
|
||
| ## Performance Work | ||
|
|
||
| Use the dedicated backend-performance workflow in [`docs/performance-workflow.md`](docs/performance-workflow.md). | ||
|
|
||
| Key entry points: | ||
|
|
||
| - Full deterministic regression gate: `scripts/compare_backends.sh` | ||
| - Focused backend workloads: `scripts/compare_workloads.sh` | ||
| - High-level API workloads: `scripts/compare_api_workloads.sh` | ||
| - Single-backend profiling: `scripts/profile_backend.sh` | ||
| - Optimized code inspection: `scripts/inspect_codegen.sh` | ||
|
|
||
| Important constraints: | ||
|
|
||
| - Never compare C and Rust sys backends in the same process. They export the same `lzma_*` symbols. | ||
| - Treat `scripts/compare_backends.sh` as the coarse deterministic gate. | ||
| - Use `qc` and `size` focused workloads for paths that are property-test-like or otherwise input-sensitive. | ||
| - Keep workload shape, iteration count, and profiler command stable while iterating on a hotspot. |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,139 @@ | ||
| # Performance Workflow | ||
|
|
||
| This repository now has a repeatable loop for backend comparison, profiling, and code generation inspection. | ||
|
|
||
| Important: the C and Rust sys backends must never be linked into the same process when comparing performance. Both export the same `lzma_*` symbols, so shared-process comparisons can silently resolve to the wrong implementation. | ||
|
|
||
| ## 1. Baseline correctness | ||
|
|
||
| Keep correctness green before taking timings: | ||
|
|
||
| ```bash | ||
| cargo test --no-default-features --features rust-backend | ||
| cargo test --no-default-features --features c-backend | ||
| cargo test --no-default-features --features rust-backend,c-backend --test sys_equivalence | ||
| ``` | ||
|
|
||
| ## 2. Compare the full test suite | ||
|
|
||
| Use `hyperfine` to compare end-to-end wall clock time of the deterministic root test bundle and `systest` with isolated target directories per backend: | ||
|
|
||
| ```bash | ||
| scripts/compare_backends.sh --runs 10 --warmup 2 | ||
| ``` | ||
|
|
||
| This writes: | ||
|
|
||
| - `target/perf-results/root-tests.json` | ||
| - `target/perf-results/root-tests.md` | ||
| - `target/perf-results/systest.json` | ||
| - `target/perf-results/systest.md` | ||
|
|
||
| Those reports are the coarse regression gate. Run them after major porting or optimization work. | ||
|
|
||
| The root bundle intentionally skips QuickCheck-based unit tests because they generate different random inputs in separate backend processes. Cover those property-style paths with the focused `qc` and `size` workloads instead. | ||
|
|
||
| ## 3. Compare focused workloads | ||
|
|
||
| Use `perf-probe`, a small standalone binary crate that links exactly one backend at a time. | ||
|
|
||
| Examples: | ||
|
|
||
| ```bash | ||
| scripts/compare_workloads.sh encode --input-kind random --size 1048576 --iters 20 --warmup 3 | ||
| scripts/compare_workloads.sh decode --input-kind random --size 1048576 --iters 50 --warmup 5 | ||
| scripts/compare_workloads.sh size --input-kind random --size 1048576 --iters 400 --warmup 40 | ||
| scripts/compare_workloads.sh crc64 --size 16777216 --iters 400 --warmup 20 | ||
| scripts/compare_workloads.sh crc64 --size 1048576 --chunk-size 16 --iters 400 --warmup 40 | ||
| ``` | ||
|
|
||
| This writes workload-specific `hyperfine` reports under `target/perf-results/`. | ||
|
|
||
| For tiny inputs, increase `--iters` until one benchmarked command takes at least around a | ||
| second. Otherwise process startup noise can dominate the comparison even when the actual | ||
| backend work is near parity. | ||
|
|
||
| There is still a criterion bench in [`benches/backend_comparison.rs`](../benches/backend_comparison.rs), but it now measures one backend per run. Use it only with exactly one backend feature enabled. | ||
|
|
||
| For high-level API regressions, compare the root crate against the upstream XZ corpus: | ||
|
|
||
| ```bash | ||
| scripts/compare_api_workloads.sh standard-files --mode all --iters 200 --warmup 20 | ||
| scripts/compare_api_workloads.sh standard-files --mode good --iters 400 --warmup 40 | ||
| scripts/compare_api_workloads.sh standard-files --mode good --name-pattern delta --iters 400 --warmup 40 | ||
| scripts/compare_api_workloads.sh qc --mode both --cases 128 --max-size 4096 --iters 200 --warmup 20 | ||
| scripts/compare_api_workloads.sh bufread-trailing --mode both --input-size 1024 --trailing-size 123 --iters 1000 --warmup 100 | ||
| ``` | ||
|
|
||
| This uses [`examples/standard_files_probe.rs`](../examples/standard_files_probe.rs), which mirrors the `tests/xz.rs` `standard_files` path and writes reports to: | ||
|
|
||
| - `target/perf-results/api-standard-files.json` | ||
| - `target/perf-results/api-standard-files.md` | ||
|
|
||
| The `qc` workload uses [`examples/qc_probe.rs`](../examples/qc_probe.rs) to reproduce the | ||
| small-input repeated round-trip pattern from the root crate tests. This is useful when | ||
| overall regressions show up in `root-tests` even though large encode/decode probes look | ||
| good. | ||
|
|
||
| The `bufread-trailing` workload uses | ||
| [`examples/bufread_trailing_probe.rs`](../examples/bufread_trailing_probe.rs) to reproduce | ||
| the `bufread::tests::compressed_and_trailing_data` path with enough in-process repetition | ||
| to reduce test process startup noise. | ||
|
|
||
| The `size` workload isolates the `uncompressed_size()` path from [`src/lib.rs`](../src/lib.rs), | ||
| which corresponds to the QuickCheck-based `tests::size` unit test but with deterministic input. | ||
|
|
||
| Use `--name-pattern <substring>` to isolate a file family inside the XZ corpus when a full-corpus comparison is too mixed to identify the remaining gap. | ||
|
|
||
| ## 4. Profile a focused workload | ||
|
|
||
| `perf-probe` is a profiler-friendly executable that runs a single workload many times with deterministic input. | ||
|
|
||
| Examples: | ||
|
|
||
| ```bash | ||
| scripts/profile_backend.sh rust decode --size 1048576 --iters 800 --warmup 80 | ||
| scripts/profile_backend.sh rust size --input-kind random --size 1048576 --iters 800 --warmup 80 | ||
| scripts/profile_backend.sh rust encode --input-kind random --size 8388608 --iters 150 --warmup 20 | ||
| scripts/profile_backend.sh c crc64 --size 16777216 --iters 400 | ||
| ``` | ||
|
|
||
| Useful flags passed through to `perf-probe`: | ||
|
|
||
| - `--workload encode|decode|size|crc32|crc64` | ||
| - `--input <path>` | ||
| - `--compressed-input <path>` | ||
| - `--save-output <path>` | ||
| - `--input-kind text|random` | ||
| - `--size <bytes>` | ||
| - `--chunk-size <bytes>` | ||
| - `--expected-size <bytes>` | ||
| - `--iters <n>` | ||
| - `--warmup <n>` | ||
| - `--preset <level>` | ||
|
|
||
| On macOS the script prefers `samply`; on Linux it falls back to `perf`; otherwise it runs the workload plainly with release debuginfo enabled. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Document the actual
Based on learnings: Use 🤖 Prompt for AI Agents |
||
|
|
||
| ## 5. Inspect the generated code | ||
|
|
||
| After a profile points to a hot Rust function, inspect its optimized output: | ||
|
|
||
| ```bash | ||
| scripts/inspect_codegen.sh xz::lzma::lzma_encoder::lzma_encode --package xz | ||
| scripts/inspect_codegen.sh xz::check::crc64_fast::lzma_crc64 --package xz --format llvm | ||
| ``` | ||
|
|
||
| This uses `cargo-asm` and builds under `target/codegen` by default. | ||
|
|
||
| ## 6. Iterate | ||
|
|
||
| The expected loop is: | ||
|
|
||
| 1. Reproduce the gap with `scripts/compare_backends.sh`. | ||
| 2. Use `scripts/compare_workloads.sh` to isolate the subsystem. | ||
| 3. Capture a focused profile with `scripts/profile_backend.sh`. | ||
| 4. Pick the hottest Rust function from the profile. | ||
| 5. Inspect its assembly or LLVM IR with `scripts/inspect_codegen.sh`. | ||
| 6. Change the Rust port, then repeat the same commands. | ||
|
|
||
| Keep the input shape, iteration count, and profiler command stable while working a hotspot so before/after numbers stay comparable. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the decode fixture identical across backend builds.
Line 114 precomputes
compressedwith the backend under test, so cross-backend decode numbers can drift with encoder output shape instead of decoder cost. Use one canonical compressed payload for every backend build.Based on learnings: Keep workload shape, iteration count, and profiler command stable while iterating on a hotspot during performance optimization.
🤖 Prompt for AI Agents