Skip to content

perf(pm): fold preload + BFS into one streaming resolve pass#2931

Closed
elrrrrrrr wants to merge 4 commits into
nextfrom
perf/resolve-fold-inline
Closed

perf(pm): fold preload + BFS into one streaming resolve pass#2931
elrrrrrrr wants to merge 4 commits into
nextfrom
perf/resolve-fold-inline

Conversation

@elrrrrrrr
Copy link
Copy Markdown
Contributor

@elrrrrrrr elrrrrrrr commented May 11, 2026

Summary

Replace ruborist's two-phase resolve (parallel preload then sequential BFS over warm cache) with a single streaming pass: FuturesUnordered drives registry fetches at config.concurrency, graph mutation runs inline on the main loop as each fetch lands.

Eliminates the ~500-1000ms warm-cache walk the BFS phase did on large workloads (ant-design ≈ 4.6k packages).

Before — two phases, sequential

   ┌──────────────────────────────────────┐
   │ run_preload_phase                    │
   │                                      │
   │   pending ──▶ FuturesUnordered       │   network parallel
   │                  │                   │
   │                  ▼                   │
   │            populate MemoryCache      │
   │                  │                   │
   │                  ▼                   │
   │      emit PackageResolved ───────────┼──▶ tgz pipeline (early)
   └──────────────────┬───────────────────┘
                      │
                      ▼  wait for all fetches
   ┌──────────────────────────────────────┐
   │ run_bfs_phase                        │
   │                                      │
   │   for level in BFS:                  │   sequential walk
   │     for node in level:               │
   │       for edge in unresolved:        │
   │         process_dependency().await   │   cache hit, just mutate
   │                                      │   ~500-1000ms wall, no I/O
   │         emit Resolved + PackagePlaced│
   └──────────────────────────────────────┘

The BFS phase is fully sequential per-edge. On a warm cache, every process_dependency().await is just a graph mutation (no network, no CPU). The await cost dominates the wall — pure overhead.

After — one streaming pass

   ┌──────────────────────────────────────────────┐
   │ run_bfs_phase (streaming fold)               │
   │                                              │
   │   pending ◀──┐                               │
   │      │       │ push new transitive           │
   │      ▼       │ edges                         │
   │   dispatch   │                               │
   │      │       │                               │
   │      ▼       │                               │
   │   FuturesUnordered (96-cap)                  │   network parallel
   │      │                                       │
   │      ▼  fetch lands                          │
   │   process_dependency(pre_resolved)           │   inline graph mutation
   │      │                                       │
   │      ├──▶ emit PackageResolved ──────────────┼──▶ tgz pipeline
   │      ├──▶ emit Resolved + PackagePlaced      │
   │      └──▶ collect new edges ─────────────────┘
   │                                              │
   │   (Reuse fast-path + non-registry edges      │
   │    short-circuit to inline handling          │
   │    without going through the futures pool)   │
   └──────────────────────────────────────────────┘

Same fetch-loop shape as the old preload, plus inline graph mutation when each fetch lands. The transitive walk that BFS used to do is now driven by the same loop — every Created node's new edges go straight back into pending.

What stays the same

  • Same BuildEvent stream: PreloadStart/Queued/Fetching/Progress, PackageResolved, Resolving/Resolved/Reused/Skipped, PackagePlaced — install pipeline (tarball downloads triggered by PackageResolved) and progress bar work unchanged.
  • Same correctness: non-registry edges (workspace/git/http/file) and Reuse hits short-circuit to inline handling, identical to before.
  • Same concurrency knob: manifests-concurrency-limit default unchanged.
  • No new modules / functions / public API: refactor is in-place.

What gets deleted

  • run_preload_phase (its work is now folded into the main streaming loop)
  • gather_preload_deps (only caller was run_preload_phase)
  • BuildDepsConfig::skip_preload field + with_skip_preload method (no preload phase to skip)
  • 2 unit tests that were probing graph state via gather_preload_deps

Net diff

crates/ruborist/src/resolver/builder.rs   +319 -308   (+11 net)
crates/ruborist/src/service/api.rs        +0   -9     (-9 net)

Test plan

  • cargo fmt --check clean (poolab)
  • cargo clippy --all-targets -- -D warnings --no-deps clean (poolab)
  • cargo build --profile release-local (14s on poolab)
  • cargo test -p utoo-pm --release: 261 passed, 3 ignored, 0 failed
  • cargo test -p utoo-ruborist --lib: 163 passed, 0 failed
  • pm-bench-phases on poolab: p0/p1/p3/p4 vs origin/next baseline (in flight)
  • pm-bench-phases on GHA: p0/p1/p3/p4 vs origin/next baseline (CI label triggered)
  • e2e: ./e2e/utoo-pm.sh

🤖 Generated with Claude Code

Replace the two-phase resolve (parallel preload then sequential BFS over
warm cache) with a single streaming pass: `FuturesUnordered` drives
registry fetches at `config.concurrency`, graph mutation runs inline on
the main loop as each fetch lands. Eliminates the ~500-1000ms warm-cache
walk on large workloads (ant-design ≈4.6k packages).

The same `BuildEvent` stream is emitted (PreloadStart/Queued/Fetching/
Progress, PackageResolved, Resolving/Resolved/Reused/Skipped,
PackagePlaced) so the install pipeline (tarball downloads triggered by
PackageResolved) and progress bar work unchanged. Non-registry edges
(workspace/git/http/file) and Reuse hits short-circuit to inline
handling — they don't benefit from the network parallelism.

Drops `BuildDepsConfig::skip_preload` field + `with_skip_preload` method
+ `gather_preload_deps` helper — all dead since there's no preload phase
to skip or feed.

Net diff: builder.rs +11 lines, api.rs -9 lines. No new modules,
functions, or public API surface change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr added the benchmark Run pm-bench on PR label May 11, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the dependency resolution process from a two-phase preload and build approach into a single streaming pass using FuturesUnordered. This change interleaves registry manifest fetches with graph mutations to improve efficiency. Feedback highlights an inaccurate comment regarding the parallelization of non-registry dependencies, a potential for redundant network fetches due to missing request coalescing, and a regression in progress reporting where a counter was hardcoded to zero.

Comment on lines +691 to +692
/// lands. Non-registry edges (workspace/git/http/file) and Reuse hits
/// run inline since they don't benefit from network parallelism.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is inaccurate. Non-registry dependencies (such as Git or HTTP tarballs) definitely benefit from network parallelism. They are run inline here primarily because process_dependency requires mutable access to the graph, which prevents concurrent execution in the current architecture, rather than a lack of potential performance gain from parallelism.

count: pending.len(),
});

loop {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation may perform redundant network fetches for the same package spec if multiple dependencies on that spec are dispatched before the first one completes and creates a graph node. The previous implementation used a HashSet to track processed specs to avoid this. While the RegistryClient might have its own caching, adding a tracking mechanism here (or ensuring the client handles request coalescing) would prevent unnecessary concurrent requests for the same manifest.

receiver.on_event(BuildEvent::PreloadProgress {
name: &edge.name,
version: &resolved.version,
current: 0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current field in BuildEvent::PreloadProgress is hardcoded to 0. This is a regression that will break progress bar updates in the CLI, as it no longer tracks the number of resolved packages. You should maintain a counter (e.g., resolved_count) within run_bfs_phase and increment it for each successful resolution to provide accurate progress data.

@elrrrrrrr elrrrrrrr marked this pull request as draft May 11, 2026 10:08
elrrrrrrr and others added 2 commits May 11, 2026 23:19
…ntity

`find_in_parent_chain` previously matched slots by `(slot_name,
version)` only. For an `npm:` alias edge `"a": "npm:b@1.0.0"` and a
real edge `"a": "^1.0.0"`, the underlying packages differ (`b` vs `a`)
but the slot name + version coincide — the old code treated them as
interchangeable. The two-phase resolve hid this because BFS level
ordering always placed root-level alias edges before any transitive,
so the asymmetry never surfaced. Streaming fold has no such guarantee:
when real `ms@^2.1.3` (transitive of `debug`) lands before the root's
`"ms": "npm:raw-body@2.1.3"` fetch, the slot gets occupied by real
`ms`, and the alias edge then wrongly reuses it.

Mirror npm's `Node.matches` + `dep-valid.js` semantics:

- Regular range edges keep their version-only check (a slot occupied
  by alias-as-raw-body@2.1.3 still satisfies a transitive
  `ms@^2.1.3` — alias is a pure directory-name occupation).
- `npm:` alias edges additionally require `child.manifest.name()` to
  equal the alias target. When a slot is occupied by a different
  underlying package the alias is explicit and reclaims the slot via
  the new `FindResult::Replace`.

`process_dependency` handles `Replace` by mutating the slot node in
place: wipes its dep self-loops, swaps in the new manifest + version,
re-emits dep edges from the new manifest. NodeIndex is preserved so
existing in-edges stay structurally valid; for the common case where
alias and previous occupant share a version (e.g. `raw-body@2.1.3`
replacing `ms@2.1.3` when debug already resolved `ms@^2.1.3`),
consumers remain satisfied. Consumer-edge re-validation for the
version-divergent case is left as a TODO.

Fixes e2e case 11e (`top-level ms should be raw-body`) which the
streaming-fold PR reintroduced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…loop

The streaming loop in `run_bfs_phase` did a Reuse pre-check before
dispatching each registry fetch so it could short-circuit on cache
hits. `process_dependency` then ran the same `find_compatible_node`
walk again after the fetch landed — meaning every cache-miss edge
paid two parent-chain walks (5000+ extra walks on ant-design's
~4600-package tree, each touching `normalize_spec` and a sibling
scan).

Drop the pre-check. `UnifiedRegistry`'s OnceMap dedups duplicate
(name, spec) fetches at the network layer, so the "wasted" dispatch
for an edge that will eventually Reuse just hits cache and returns
immediately. Net: one `find_compatible_node` per registry edge
instead of two.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · 54524af · linux (ubuntu-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 13.23s 1.03s 8.15s 7.99s 684M 363.4K
utoo-next 10.86s 1.44s 8.41s 9.77s 944M 124.3K
utoo-npm 13.09s 2.47s 9.08s 10.46s 1.41G 171.7K
utoo 12.79s 0.75s 8.56s 9.89s 1.02G 121.7K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 30.1K 14.8K 1.20G 8M 1.90G 1.78G 1M
utoo-next 163.6K 107.4K 1.17G 6M 1.73G 1.73G 2M
utoo-npm 183.4K 136.4K 1.17G 7M 1.73G 1.73G 2M
utoo 171.1K 106.8K 1.17G 6M 1.73G 1.71G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 2.71s 0.06s 2.94s 0.95s 480M 184.4K
utoo-next 3.59s 0.66s 4.26s 1.90s 615M 82.3K
utoo-npm 3.24s 0.13s 4.19s 1.88s 596M 82.1K
utoo 3.31s 0.05s 4.47s 1.99s 599M 85.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 19.9K 2.0K 204M 4M 107M - 1M
utoo-next 93.0K 117.6K 201M 3M 7M 3M 2M
utoo-npm 92.0K 117.0K 201M 3M 7M 3M 2M
utoo 94.8K 137.8K 201M 3M 7M 3M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 8.27s 0.24s 4.93s 7.52s 599M 209.0K
utoo-next 12.54s 3.19s 4.25s 8.46s 477M 62.1K
utoo-npm 8.81s 0.48s 4.55s 8.86s 821M 114.9K
utoo 11.67s 3.00s 4.21s 8.39s 476M 60.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 14.0K 6.7K 1.00G 5M 1.78G 1.78G 1M
utoo-next 146.9K 46.1K 1001M 4M 1.72G 1.72G 2M
utoo-npm 145.4K 77.1K 1001M 4M 1.72G 1.72G 2M
utoo 137.7K 51.5K 1001M 4M 1.72G 1.72G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 6.24s 0.92s 0.17s 1.95s 140M 34.1K
utoo-next 2.90s 0.36s 0.39s 2.88s 79M 18.3K
utoo-npm 3.59s 0.19s 0.41s 2.94s 83M 19.1K
utoo 4.09s 0.74s 0.40s 2.92s 79M 18.6K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 373 26 5M 64K 1.93G 1.77G 1M
utoo-next 41.3K 18.0K 4K 9K 1.72G 1.72G 2M
utoo-npm 44.3K 18.0K 23K 36K 1.72G 1.72G 2M
utoo 42.3K 17.1K 5K 6K 1.73G 1.72G 2M

npmmirror.com: no output captured.

@elrrrrrrr
Copy link
Copy Markdown
Contributor Author

Closing — streaming fold experiment failed exit criteria for the perf-validation loop.

Bench data (GHA Linux, same-run apples-to-apples)

Phase PR-D utoo-next Δ
p0_full_cold 17.36 ± 4.07s 13.20 ± 0.89s +31% ❌
p1_resolve 3.19 ± 0.19s 2.71 ± 0.10s +18% ❌
p3_cold_install 10.82 ± 4.23s 8.63 ± 1.51s +25% ❌
p4_warm_link 4.78 ± 0.55s 3.61 ± 0.36s +32% ❌

σ inflation 4-5× of utoo-next is a structural signal — not noise. Even if absolute numbers vary with GHA network conditions, the relative regression is consistent.

Why streaming fold is structurally slower here

find_compatible_node is called twice per registry edge:

  1. Streaming loop pre-check (load-bearing for workspace dep resolution + Reuse fast path)
  2. Inside process_dependency after fetch lands

Removing the pre-check broke workspace dep resolution (verified e2e — workspaces-scoped-pkg failed with 404 because @ruyadorno/scoped-b is a workspace package, not a registry package). The pre-check isn't redundant, it's load-bearing.

The two-pass find_compatible_node per edge is ~5000 × 2 parent-chain walks for ant-design. Old preload + BFS only walks once per edge (inside process_dependency) and benefits from a cleaner separation: preload's fetch loop has no graph mutation overhead.

What's salvageable from this experiment

The npm-aware find_compatible_node fix (the Replace variant + child.manifest.name() underlying-identity check) is an architectural correctness improvement that's orthogonal to streaming fold — npm's Node.matches checks packageName, ours used to match by slot name + version only. Currently masked by BFS level ordering in origin/next; could be revisited as a defensive fix in a smaller PR.

E2e status

  • macOS aarch64 / macOS x86_64 / Windows: PASS (alias case 11e fix verified)
  • Ubuntu Linux: workspace dep test FAIL on intermediate commit, PASS on full PR-D

Follow-up

Per the autonomous loop's fallback path, this PR is closed. The streaming fold direction is set aside; if perf work resumes on the resolve hot path, the more promising direction is the channel-based architecture from #2929 with a focus on bounding code size.

@elrrrrrrr elrrrrrrr closed this May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Run pm-bench on PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant