perf(pm): fold preload + BFS into one streaming resolve pass#2931
perf(pm): fold preload + BFS into one streaming resolve pass#2931elrrrrrrr wants to merge 4 commits into
Conversation
Replace the two-phase resolve (parallel preload then sequential BFS over warm cache) with a single streaming pass: `FuturesUnordered` drives registry fetches at `config.concurrency`, graph mutation runs inline on the main loop as each fetch lands. Eliminates the ~500-1000ms warm-cache walk on large workloads (ant-design ≈4.6k packages). The same `BuildEvent` stream is emitted (PreloadStart/Queued/Fetching/ Progress, PackageResolved, Resolving/Resolved/Reused/Skipped, PackagePlaced) so the install pipeline (tarball downloads triggered by PackageResolved) and progress bar work unchanged. Non-registry edges (workspace/git/http/file) and Reuse hits short-circuit to inline handling — they don't benefit from the network parallelism. Drops `BuildDepsConfig::skip_preload` field + `with_skip_preload` method + `gather_preload_deps` helper — all dead since there's no preload phase to skip or feed. Net diff: builder.rs +11 lines, api.rs -9 lines. No new modules, functions, or public API surface change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the dependency resolution process from a two-phase preload and build approach into a single streaming pass using FuturesUnordered. This change interleaves registry manifest fetches with graph mutations to improve efficiency. Feedback highlights an inaccurate comment regarding the parallelization of non-registry dependencies, a potential for redundant network fetches due to missing request coalescing, and a regression in progress reporting where a counter was hardcoded to zero.
| /// lands. Non-registry edges (workspace/git/http/file) and Reuse hits | ||
| /// run inline since they don't benefit from network parallelism. |
There was a problem hiding this comment.
This comment is inaccurate. Non-registry dependencies (such as Git or HTTP tarballs) definitely benefit from network parallelism. They are run inline here primarily because process_dependency requires mutable access to the graph, which prevents concurrent execution in the current architecture, rather than a lack of potential performance gain from parallelism.
| count: pending.len(), | ||
| }); | ||
|
|
||
| loop { |
There was a problem hiding this comment.
The current implementation may perform redundant network fetches for the same package spec if multiple dependencies on that spec are dispatched before the first one completes and creates a graph node. The previous implementation used a HashSet to track processed specs to avoid this. While the RegistryClient might have its own caching, adding a tracking mechanism here (or ensuring the client handles request coalescing) would prevent unnecessary concurrent requests for the same manifest.
| receiver.on_event(BuildEvent::PreloadProgress { | ||
| name: &edge.name, | ||
| version: &resolved.version, | ||
| current: 0, |
There was a problem hiding this comment.
The current field in BuildEvent::PreloadProgress is hardcoded to 0. This is a regression that will break progress bar updates in the CLI, as it no longer tracks the number of resolved packages. You should maintain a counter (e.g., resolved_count) within run_bfs_phase and increment it for each successful resolution to provide accurate progress data.
…ntity `find_in_parent_chain` previously matched slots by `(slot_name, version)` only. For an `npm:` alias edge `"a": "npm:b@1.0.0"` and a real edge `"a": "^1.0.0"`, the underlying packages differ (`b` vs `a`) but the slot name + version coincide — the old code treated them as interchangeable. The two-phase resolve hid this because BFS level ordering always placed root-level alias edges before any transitive, so the asymmetry never surfaced. Streaming fold has no such guarantee: when real `ms@^2.1.3` (transitive of `debug`) lands before the root's `"ms": "npm:raw-body@2.1.3"` fetch, the slot gets occupied by real `ms`, and the alias edge then wrongly reuses it. Mirror npm's `Node.matches` + `dep-valid.js` semantics: - Regular range edges keep their version-only check (a slot occupied by alias-as-raw-body@2.1.3 still satisfies a transitive `ms@^2.1.3` — alias is a pure directory-name occupation). - `npm:` alias edges additionally require `child.manifest.name()` to equal the alias target. When a slot is occupied by a different underlying package the alias is explicit and reclaims the slot via the new `FindResult::Replace`. `process_dependency` handles `Replace` by mutating the slot node in place: wipes its dep self-loops, swaps in the new manifest + version, re-emits dep edges from the new manifest. NodeIndex is preserved so existing in-edges stay structurally valid; for the common case where alias and previous occupant share a version (e.g. `raw-body@2.1.3` replacing `ms@2.1.3` when debug already resolved `ms@^2.1.3`), consumers remain satisfied. Consumer-edge re-validation for the version-divergent case is left as a TODO. Fixes e2e case 11e (`top-level ms should be raw-body`) which the streaming-fold PR reintroduced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…loop The streaming loop in `run_bfs_phase` did a Reuse pre-check before dispatching each registry fetch so it could short-circuit on cache hits. `process_dependency` then ran the same `find_compatible_node` walk again after the fetch landed — meaning every cache-miss edge paid two parent-chain walks (5000+ extra walks on ant-design's ~4600-package tree, each touching `normalize_spec` and a sibling scan). Drop the pre-check. `UnifiedRegistry`'s OnceMap dedups duplicate (name, spec) fetches at the network layer, so the "wasted" dispatch for an edge that will eventually Reuse just hits cache and returns immediately. Net: one `find_compatible_node` per registry edge instead of two. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📊 pm-bench-phases ·
|
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 13.23s | 1.03s | 8.15s | 7.99s | 684M | 363.4K |
| utoo-next | 10.86s | 1.44s | 8.41s | 9.77s | 944M | 124.3K |
| utoo-npm | 13.09s | 2.47s | 9.08s | 10.46s | 1.41G | 171.7K |
| utoo | 12.79s | 0.75s | 8.56s | 9.89s | 1.02G | 121.7K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 30.1K | 14.8K | 1.20G | 8M | 1.90G | 1.78G | 1M |
| utoo-next | 163.6K | 107.4K | 1.17G | 6M | 1.73G | 1.73G | 2M |
| utoo-npm | 183.4K | 136.4K | 1.17G | 7M | 1.73G | 1.73G | 2M |
| utoo | 171.1K | 106.8K | 1.17G | 6M | 1.73G | 1.71G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 2.71s | 0.06s | 2.94s | 0.95s | 480M | 184.4K |
| utoo-next | 3.59s | 0.66s | 4.26s | 1.90s | 615M | 82.3K |
| utoo-npm | 3.24s | 0.13s | 4.19s | 1.88s | 596M | 82.1K |
| utoo | 3.31s | 0.05s | 4.47s | 1.99s | 599M | 85.3K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 19.9K | 2.0K | 204M | 4M | 107M | - | 1M |
| utoo-next | 93.0K | 117.6K | 201M | 3M | 7M | 3M | 2M |
| utoo-npm | 92.0K | 117.0K | 201M | 3M | 7M | 3M | 2M |
| utoo | 94.8K | 137.8K | 201M | 3M | 7M | 3M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 8.27s | 0.24s | 4.93s | 7.52s | 599M | 209.0K |
| utoo-next | 12.54s | 3.19s | 4.25s | 8.46s | 477M | 62.1K |
| utoo-npm | 8.81s | 0.48s | 4.55s | 8.86s | 821M | 114.9K |
| utoo | 11.67s | 3.00s | 4.21s | 8.39s | 476M | 60.3K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 14.0K | 6.7K | 1.00G | 5M | 1.78G | 1.78G | 1M |
| utoo-next | 146.9K | 46.1K | 1001M | 4M | 1.72G | 1.72G | 2M |
| utoo-npm | 145.4K | 77.1K | 1001M | 4M | 1.72G | 1.72G | 2M |
| utoo | 137.7K | 51.5K | 1001M | 4M | 1.72G | 1.72G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 6.24s | 0.92s | 0.17s | 1.95s | 140M | 34.1K |
| utoo-next | 2.90s | 0.36s | 0.39s | 2.88s | 79M | 18.3K |
| utoo-npm | 3.59s | 0.19s | 0.41s | 2.94s | 83M | 19.1K |
| utoo | 4.09s | 0.74s | 0.40s | 2.92s | 79M | 18.6K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 373 | 26 | 5M | 64K | 1.93G | 1.77G | 1M |
| utoo-next | 41.3K | 18.0K | 4K | 9K | 1.72G | 1.72G | 2M |
| utoo-npm | 44.3K | 18.0K | 23K | 36K | 1.72G | 1.72G | 2M |
| utoo | 42.3K | 17.1K | 5K | 6K | 1.73G | 1.72G | 2M |
npmmirror.com: no output captured.
…reaming loop" This reverts commit 813c99f.
|
Closing — streaming fold experiment failed exit criteria for the perf-validation loop. Bench data (GHA Linux, same-run apples-to-apples)
σ inflation 4-5× of utoo-next is a structural signal — not noise. Even if absolute numbers vary with GHA network conditions, the relative regression is consistent. Why streaming fold is structurally slower here
Removing the pre-check broke workspace dep resolution (verified e2e — The two-pass What's salvageable from this experimentThe npm-aware E2e status
Follow-upPer the autonomous loop's fallback path, this PR is closed. The streaming fold direction is set aside; if perf work resumes on the resolve hot path, the more promising direction is the channel-based architecture from #2929 with a focus on bounding code size. |
Summary
Replace ruborist's two-phase resolve (parallel preload then sequential BFS over warm cache) with a single streaming pass:
FuturesUnordereddrives registry fetches atconfig.concurrency, graph mutation runs inline on the main loop as each fetch lands.Eliminates the ~500-1000ms warm-cache walk the BFS phase did on large workloads (ant-design ≈ 4.6k packages).
Before — two phases, sequential
The BFS phase is fully sequential per-edge. On a warm cache, every
process_dependency().awaitis just a graph mutation (no network, no CPU). The await cost dominates the wall — pure overhead.After — one streaming pass
Same fetch-loop shape as the old preload, plus inline graph mutation when each fetch lands. The transitive walk that BFS used to do is now driven by the same loop — every Created node's new edges go straight back into
pending.What stays the same
BuildEventstream:PreloadStart/Queued/Fetching/Progress,PackageResolved,Resolving/Resolved/Reused/Skipped,PackagePlaced— install pipeline (tarball downloads triggered byPackageResolved) and progress bar work unchanged.manifests-concurrency-limitdefault unchanged.What gets deleted
run_preload_phase(its work is now folded into the main streaming loop)gather_preload_deps(only caller wasrun_preload_phase)BuildDepsConfig::skip_preloadfield +with_skip_preloadmethod (no preload phase to skip)gather_preload_depsNet diff
Test plan
cargo fmt --checkclean (poolab)cargo clippy --all-targets -- -D warnings --no-depsclean (poolab)cargo build --profile release-local(14s on poolab)cargo test -p utoo-pm --release: 261 passed, 3 ignored, 0 failedcargo test -p utoo-ruborist --lib: 163 passed, 0 failed./e2e/utoo-pm.sh🤖 Generated with Claude Code