perf(pm): fold preload + BFS into one streaming resolve pass by elrrrrrrr · Pull Request #2931 · utooland/utoo

elrrrrrrr · 2026-05-11T10:05:59Z

Summary

Replace ruborist's two-phase resolve (parallel preload then sequential BFS over warm cache) with a single streaming pass: FuturesUnordered drives registry fetches at config.concurrency, graph mutation runs inline on the main loop as each fetch lands.

Eliminates the ~500-1000ms warm-cache walk the BFS phase did on large workloads (ant-design ≈ 4.6k packages).

Before — two phases, sequential

   ┌──────────────────────────────────────┐
   │ run_preload_phase                    │
   │                                      │
   │   pending ──▶ FuturesUnordered       │   network parallel
   │                  │                   │
   │                  ▼                   │
   │            populate MemoryCache      │
   │                  │                   │
   │                  ▼                   │
   │      emit PackageResolved ───────────┼──▶ tgz pipeline (early)
   └──────────────────┬───────────────────┘
                      │
                      ▼  wait for all fetches
   ┌──────────────────────────────────────┐
   │ run_bfs_phase                        │
   │                                      │
   │   for level in BFS:                  │   sequential walk
   │     for node in level:               │
   │       for edge in unresolved:        │
   │         process_dependency().await   │   cache hit, just mutate
   │                                      │   ~500-1000ms wall, no I/O
   │         emit Resolved + PackagePlaced│
   └──────────────────────────────────────┘

The BFS phase is fully sequential per-edge. On a warm cache, every process_dependency().await is just a graph mutation (no network, no CPU). The await cost dominates the wall — pure overhead.

After — one streaming pass

   ┌──────────────────────────────────────────────┐
   │ run_bfs_phase (streaming fold)               │
   │                                              │
   │   pending ◀──┐                               │
   │      │       │ push new transitive           │
   │      ▼       │ edges                         │
   │   dispatch   │                               │
   │      │       │                               │
   │      ▼       │                               │
   │   FuturesUnordered (96-cap)                  │   network parallel
   │      │                                       │
   │      ▼  fetch lands                          │
   │   process_dependency(pre_resolved)           │   inline graph mutation
   │      │                                       │
   │      ├──▶ emit PackageResolved ──────────────┼──▶ tgz pipeline
   │      ├──▶ emit Resolved + PackagePlaced      │
   │      └──▶ collect new edges ─────────────────┘
   │                                              │
   │   (Reuse fast-path + non-registry edges      │
   │    short-circuit to inline handling          │
   │    without going through the futures pool)   │
   └──────────────────────────────────────────────┘

Same fetch-loop shape as the old preload, plus inline graph mutation when each fetch lands. The transitive walk that BFS used to do is now driven by the same loop — every Created node's new edges go straight back into pending.

What stays the same

Same BuildEvent stream: PreloadStart/Queued/Fetching/Progress, PackageResolved, Resolving/Resolved/Reused/Skipped, PackagePlaced — install pipeline (tarball downloads triggered by PackageResolved) and progress bar work unchanged.
Same correctness: non-registry edges (workspace/git/http/file) and Reuse hits short-circuit to inline handling, identical to before.
Same concurrency knob: manifests-concurrency-limit default unchanged.
No new modules / functions / public API: refactor is in-place.

What gets deleted

run_preload_phase (its work is now folded into the main streaming loop)
gather_preload_deps (only caller was run_preload_phase)
BuildDepsConfig::skip_preload field + with_skip_preload method (no preload phase to skip)
2 unit tests that were probing graph state via gather_preload_deps

Net diff

crates/ruborist/src/resolver/builder.rs   +319 -308   (+11 net)
crates/ruborist/src/service/api.rs        +0   -9     (-9 net)

Test plan

cargo fmt --check clean (poolab)
cargo clippy --all-targets -- -D warnings --no-deps clean (poolab)
cargo build --profile release-local (14s on poolab)
cargo test -p utoo-pm --release: 261 passed, 3 ignored, 0 failed
cargo test -p utoo-ruborist --lib: 163 passed, 0 failed
pm-bench-phases on poolab: p0/p1/p3/p4 vs origin/next baseline (in flight)
pm-bench-phases on GHA: p0/p1/p3/p4 vs origin/next baseline (CI label triggered)
e2e: ./e2e/utoo-pm.sh

🤖 Generated with Claude Code

Replace the two-phase resolve (parallel preload then sequential BFS over warm cache) with a single streaming pass: `FuturesUnordered` drives registry fetches at `config.concurrency`, graph mutation runs inline on the main loop as each fetch lands. Eliminates the ~500-1000ms warm-cache walk on large workloads (ant-design ≈4.6k packages). The same `BuildEvent` stream is emitted (PreloadStart/Queued/Fetching/ Progress, PackageResolved, Resolving/Resolved/Reused/Skipped, PackagePlaced) so the install pipeline (tarball downloads triggered by PackageResolved) and progress bar work unchanged. Non-registry edges (workspace/git/http/file) and Reuse hits short-circuit to inline handling — they don't benefit from the network parallelism. Drops `BuildDepsConfig::skip_preload` field + `with_skip_preload` method + `gather_preload_deps` helper — all dead since there's no preload phase to skip or feed. Net diff: builder.rs +11 lines, api.rs -9 lines. No new modules, functions, or public API surface change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request refactors the dependency resolution process from a two-phase preload and build approach into a single streaming pass using FuturesUnordered. This change interleaves registry manifest fetches with graph mutations to improve efficiency. Feedback highlights an inaccurate comment regarding the parallelization of non-registry dependencies, a potential for redundant network fetches due to missing request coalescing, and a regression in progress reporting where a counter was hardcoded to zero.

gemini-code-assist · 2026-05-11T10:07:29Z

+/// lands. Non-registry edges (workspace/git/http/file) and Reuse hits
+/// run inline since they don't benefit from network parallelism.


This comment is inaccurate. Non-registry dependencies (such as Git or HTTP tarballs) definitely benefit from network parallelism. They are run inline here primarily because process_dependency requires mutable access to the graph, which prevents concurrent execution in the current architecture, rather than a lack of potential performance gain from parallelism.

gemini-code-assist · 2026-05-11T10:07:29Z

+        count: pending.len(),
+    });
+
+    loop {


The current implementation may perform redundant network fetches for the same package spec if multiple dependencies on that spec are dispatched before the first one completes and creates a graph node. The previous implementation used a HashSet to track processed specs to avoid this. While the RegistryClient might have its own caching, adding a tracking mechanism here (or ensuring the client handles request coalescing) would prevent unnecessary concurrent requests for the same manifest.

gemini-code-assist · 2026-05-11T10:07:29Z

+        receiver.on_event(BuildEvent::PreloadProgress {
+            name: &edge.name,
+            version: &resolved.version,
+            current: 0,


The current field in BuildEvent::PreloadProgress is hardcoded to 0. This is a regression that will break progress bar updates in the CLI, as it no longer tracks the number of resolved packages. You should maintain a counter (e.g., resolved_count) within run_bfs_phase and increment it for each successful resolution to provide accurate progress data.

…ntity `find_in_parent_chain` previously matched slots by `(slot_name, version)` only. For an `npm:` alias edge `"a": "npm:b@1.0.0"` and a real edge `"a": "^1.0.0"`, the underlying packages differ (`b` vs `a`) but the slot name + version coincide — the old code treated them as interchangeable. The two-phase resolve hid this because BFS level ordering always placed root-level alias edges before any transitive, so the asymmetry never surfaced. Streaming fold has no such guarantee: when real `ms@^2.1.3` (transitive of `debug`) lands before the root's `"ms": "npm:raw-body@2.1.3"` fetch, the slot gets occupied by real `ms`, and the alias edge then wrongly reuses it. Mirror npm's `Node.matches` + `dep-valid.js` semantics: - Regular range edges keep their version-only check (a slot occupied by alias-as-raw-body@2.1.3 still satisfies a transitive `ms@^2.1.3` — alias is a pure directory-name occupation). - `npm:` alias edges additionally require `child.manifest.name()` to equal the alias target. When a slot is occupied by a different underlying package the alias is explicit and reclaims the slot via the new `FindResult::Replace`. `process_dependency` handles `Replace` by mutating the slot node in place: wipes its dep self-loops, swaps in the new manifest + version, re-emits dep edges from the new manifest. NodeIndex is preserved so existing in-edges stay structurally valid; for the common case where alias and previous occupant share a version (e.g. `raw-body@2.1.3` replacing `ms@2.1.3` when debug already resolved `ms@^2.1.3`), consumers remain satisfied. Consumer-edge re-validation for the version-divergent case is left as a TODO. Fixes e2e case 11e (`top-level ms should be raw-body`) which the streaming-fold PR reintroduced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…loop The streaming loop in `run_bfs_phase` did a Reuse pre-check before dispatching each registry fetch so it could short-circuit on cache hits. `process_dependency` then ran the same `find_compatible_node` walk again after the fetch landed — meaning every cache-miss edge paid two parent-chain walks (5000+ extra walks on ant-design's ~4600-package tree, each touching `normalize_spec` and a sibling scan). Drop the pre-check. `UnifiedRegistry`'s OnceMap dedups duplicate (name, spec) fetches at the network layer, so the "wasted" dispatch for an edge that will eventually Reuse just hits cache and returns immediately. Net: one `find_compatible_node` per registry edge instead of two. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-11T16:56:43Z

📊 pm-bench-phases · `54524af` · linux (`ubuntu-latest`)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM	wall	±σ	user	sys	RSS	pgMinor
bun	13.23s	1.03s	8.15s	7.99s	684M	363.4K
utoo-next	10.86s	1.44s	8.41s	9.77s	944M	124.3K
utoo-npm	13.09s	2.47s	9.08s	10.46s	1.41G	171.7K
utoo	12.79s	0.75s	8.56s	9.89s	1.02G	121.7K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	30.1K	14.8K	1.20G	8M	1.90G	1.78G	1M
utoo-next	163.6K	107.4K	1.17G	6M	1.73G	1.73G	2M
utoo-npm	183.4K	136.4K	1.17G	7M	1.73G	1.73G	2M
utoo	171.1K	106.8K	1.17G	6M	1.73G	1.71G	2M

p1_resolve

PM	wall	±σ	user	sys	RSS	pgMinor
bun	2.71s	0.06s	2.94s	0.95s	480M	184.4K
utoo-next	3.59s	0.66s	4.26s	1.90s	615M	82.3K
utoo-npm	3.24s	0.13s	4.19s	1.88s	596M	82.1K
utoo	3.31s	0.05s	4.47s	1.99s	599M	85.3K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	19.9K	2.0K	204M	4M	107M	-	1M
utoo-next	93.0K	117.6K	201M	3M	7M	3M	2M
utoo-npm	92.0K	117.0K	201M	3M	7M	3M	2M
utoo	94.8K	137.8K	201M	3M	7M	3M	2M

p3_cold_install

PM	wall	±σ	user	sys	RSS	pgMinor
bun	8.27s	0.24s	4.93s	7.52s	599M	209.0K
utoo-next	12.54s	3.19s	4.25s	8.46s	477M	62.1K
utoo-npm	8.81s	0.48s	4.55s	8.86s	821M	114.9K
utoo	11.67s	3.00s	4.21s	8.39s	476M	60.3K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	14.0K	6.7K	1.00G	5M	1.78G	1.78G	1M
utoo-next	146.9K	46.1K	1001M	4M	1.72G	1.72G	2M
utoo-npm	145.4K	77.1K	1001M	4M	1.72G	1.72G	2M
utoo	137.7K	51.5K	1001M	4M	1.72G	1.72G	2M

p4_warm_link

PM	wall	±σ	user	sys	RSS	pgMinor
bun	6.24s	0.92s	0.17s	1.95s	140M	34.1K
utoo-next	2.90s	0.36s	0.39s	2.88s	79M	18.3K
utoo-npm	3.59s	0.19s	0.41s	2.94s	83M	19.1K
utoo	4.09s	0.74s	0.40s	2.92s	79M	18.6K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	373	26	5M	64K	1.93G	1.77G	1M
utoo-next	41.3K	18.0K	4K	9K	1.72G	1.72G	2M
utoo-npm	44.3K	18.0K	23K	36K	1.72G	1.72G	2M
utoo	42.3K	17.1K	5K	6K	1.73G	1.72G	2M

npmmirror.com: no output captured.

…reaming loop" This reverts commit 813c99f.

elrrrrrrr · 2026-05-11T17:02:56Z

Closing — streaming fold experiment failed exit criteria for the perf-validation loop.

Bench data (GHA Linux, same-run apples-to-apples)

Phase	PR-D	utoo-next	Δ
p0_full_cold	17.36 ± 4.07s	13.20 ± 0.89s	+31% ❌
p1_resolve	3.19 ± 0.19s	2.71 ± 0.10s	+18% ❌
p3_cold_install	10.82 ± 4.23s	8.63 ± 1.51s	+25% ❌
p4_warm_link	4.78 ± 0.55s	3.61 ± 0.36s	+32% ❌

σ inflation 4-5× of utoo-next is a structural signal — not noise. Even if absolute numbers vary with GHA network conditions, the relative regression is consistent.

Why streaming fold is structurally slower here

find_compatible_node is called twice per registry edge:

Streaming loop pre-check (load-bearing for workspace dep resolution + Reuse fast path)
Inside process_dependency after fetch lands

Removing the pre-check broke workspace dep resolution (verified e2e — workspaces-scoped-pkg failed with 404 because @ruyadorno/scoped-b is a workspace package, not a registry package). The pre-check isn't redundant, it's load-bearing.

The two-pass find_compatible_node per edge is ~5000 × 2 parent-chain walks for ant-design. Old preload + BFS only walks once per edge (inside process_dependency) and benefits from a cleaner separation: preload's fetch loop has no graph mutation overhead.

What's salvageable from this experiment

The npm-aware find_compatible_node fix (the Replace variant + child.manifest.name() underlying-identity check) is an architectural correctness improvement that's orthogonal to streaming fold — npm's Node.matches checks packageName, ours used to match by slot name + version only. Currently masked by BFS level ordering in origin/next; could be revisited as a defensive fix in a smaller PR.

E2e status

macOS aarch64 / macOS x86_64 / Windows: PASS (alias case 11e fix verified)
Ubuntu Linux: workspace dep test FAIL on intermediate commit, PASS on full PR-D

Follow-up

Per the autonomous loop's fallback path, this PR is closed. The streaming fold direction is set aside; if perf work resumes on the resolve hot path, the more promising direction is the channel-based architecture from #2929 with a focus on bounding code size.

elrrrrrrr added the benchmark Run pm-bench on PR label May 11, 2026

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

elrrrrrrr marked this pull request as draft May 11, 2026 10:08

elrrrrrrr and others added 2 commits May 11, 2026 23:19

Revert "perf(pm): skip redundant find_compatible_node pre-check in st…

572d976

…reaming loop" This reverts commit 813c99f.

elrrrrrrr closed this May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(pm): fold preload + BFS into one streaming resolve pass#2931

perf(pm): fold preload + BFS into one streaming resolve pass#2931
elrrrrrrr wants to merge 4 commits into
nextfrom
perf/resolve-fold-inline

elrrrrrrr commented May 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

elrrrrrrr commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		/// lands. Non-registry edges (workspace/git/http/file) and Reuse hits
		/// run inline since they don't benefit from network parallelism.

Conversation

elrrrrrrr commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Before — two phases, sequential

After — one streaming pass

What stays the same

What gets deleted

Net diff

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 11, 2026

📊 pm-bench-phases · 54524af · linux (ubuntu-latest)

npmjs.org

p0_full_cold

p1_resolve

p3_cold_install

p4_warm_link

Uh oh!

elrrrrrrr commented May 11, 2026

Bench data (GHA Linux, same-run apples-to-apples)

Why streaming fold is structurally slower here

What's salvageable from this experiment

E2e status

Follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

elrrrrrrr commented May 11, 2026 •

edited

Loading

📊 pm-bench-phases · `54524af` · linux (`ubuntu-latest`)