Skip to content

perf(pm): two-phase sibling parse + 4× channel buffer (p1 attack)#2929

Draft
elrrrrrrr wants to merge 40 commits into
nextfrom
perf/p1-sibling-two-phase
Draft

perf(pm): two-phase sibling parse + 4× channel buffer (p1 attack)#2929
elrrrrrrr wants to merge 40 commits into
nextfrom
perf/p1-sibling-two-phase

Conversation

@elrrrrrrr
Copy link
Copy Markdown
Contributor

Two micro-opts on PR #2924 stack targeting the 0.5s framework overhead vs manifest-bench. See commit message.

elrrrrrrr and others added 30 commits May 8, 2026 21:56
…down

p1_resolve has been ~0.9s behind bun on phases bench for the past
several PRs. Pcap on prior runs measured bun opening ~260 parallel
TCP streams against registry.npmjs.org for resolve, while utoo
opened ~70 (the 64 manifests-concurrency-limit cap was at saturation).

Adding fetch-breakdown timing in ruborist showed where p1's 22s
(local Mac) actually goes:

  fetch-timings: n=2730
    sum_request   = 1089s   (88% — TCP+TLS+HTTP RTT to first byte)
    sum_body      = 138s    (11% — body download)
    sum_parse     = 2s      (0.16% — simd_json on rayon)

The dominant cost is per-request RTT, not parsing or body transfer.
The lever is the cap on concurrent in-flight requests.

This commit:

1. Adds `crates/ruborist/src/util/timing.rs` — process-wide atomic
   accumulator that records per-fetch (request_us, body_us,
   parse_us, bytes) inside both `fetch_full_manifest` and
   `fetch_version_manifest`. Reset before each preload phase, dumped
   at INFO level after preload + bfs.

2. Bumps `manifests-concurrency-limit` default 64 → 256 to match
   bun's observed working point against npmjs.org.

CI bench will validate. Expected: p1 utoo wall drops toward bun's
range (~2.3s on GHA).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes after the GHA bench on the previous commit (PR #2916,
run 25559625024) showed the concurrency=256 hypothesis was wrong on
GHA's environment.

Revert concurrency 256 → 64
---------------------------

The new fetch-timing instrumentation shipped in the previous commit
caught the surprise: GHA's pcap-vs-local profile is the *opposite*
of what local Mac measurements suggested.

  metric          local Mac    GHA Linux
  avg_request     399ms        70ms      ← network MUCH faster on GHA
  avg_body         50ms        20ms
  avg_parse       730µs        266ms     ← parse 365× SLOWER on GHA

Mechanism: `parse_json_off_runtime` dispatches to `rayon::spawn`,
and rayon's pool size is `num_cpus` (= 2 on GHA ubuntu-latest).
Bumping concurrency 64 → 256 queued 256 manifest parses behind 2
rayon workers — head-of-line blocking. avg_parse jumped from ~10ms
to 266ms wall, dragging p1 utoo wall from 3.10s up to 3.33s.

Restore manifest-bench
----------------------

Brought back `crates/manifest-bench` (originally landed in the
post-#2818 driver hunt, dropped in af714eb once #2818 graduated).
It's a single-binary HTTP-only fetch tool that strips out the
ruborist pipeline (no BFS, no dedup, no parse, no project cache,
no lockfile write) — fires `GET <registry>/<name>` in parallel
and reports the same diag shape as the new `p1-breakdown` lines.

Goal: separate the network ceiling from the resolver pipeline so
the next round of p1 experiments (parse offload, partial parse,
dedicated parse pool, etc.) can be evaluated against a stable
"pure network" baseline.

Knobs (unchanged from the original drop):
  --concurrency N    sweep without rebuilding utoo
  --reps N           run same workload back-to-back
  --single-version   use /<name>/latest (smaller bodies)
  --user-agent X     UA-fingerprint experiments
  --http1-only       H2 vs H1 toggle
  --accept X         override Accept header

Same TLS stack as ruborist (rustls + aws-lc-rs, native roots).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inux

build-linux now also builds + uploads `manifest-bench` when a phases
bench is going to run (label or dispatch). bench-phases-linux
downloads the binary and runs it after the regular phase-isolated
benchmark.

Sweep mirrors the original (#2818-era) wire-in:

  concurrency: 32 / 64 / 96 / 128 / 192 / 256  (HTTP/1.1, full manifest)
  protocol:    H1 vs H2-negotiate  (cap=128)
  endpoint:    full vs `/<name>/latest`  (cap=128, smaller bodies)
  UA:          default vs `Bun/1.2.21`  (cap=128)

Output goes to /tmp/pm-bench-output/manifest-bench-npmjs.log and
ships in the existing pm-bench-logs-linux artifact — no PR comment
surface (the headline phases bench comment stays the same).

Why now: the new ruborist `p1-breakdown` instrumentation showed
sum_parse on GHA can dominate when concurrency is bumped (256:
sum_parse 728s vs sum_request 193s). To attribute the bun-vs-utoo
gap on p1_resolve we need a "pure HTTP" baseline that strips out
ruborist's parse / BFS / dedup / lockfile path. manifest-bench is
that baseline: same TLS stack as ruborist (rustls + aws-lc-rs,
native roots), no resolver pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI fetch-breakdown on GHA (run 25562552058, conc=64) showed parse
queueing on rayon dominates the gap to manifest-bench's pure-HTTP
baseline:

  manifest-bench (pure HTTP, conc=64): 2.12s wall
  utoo p1 (full ruborist):             3.10s wall  ← +1.0s overhead
  ↑ sum_parse 95s vs sum_request 95s, parse 50% of work-time
  ↑ avg_parse 30ms wall vs ~5ms actual CPU — the 25ms extra is rayon
    queue wait

Mechanism: 64 concurrent tasks all dispatching parse to rayon's pool
(size = num_cpus = 2 on GHA). Queue depth grows to ~32 per worker.
Each parse waits 25ms+ in queue before running its 5ms of CPU work.

Round 1 fix: inline parse, drop the rayon hop. simd_json on a tokio
worker thread is fast (~5ms for 115KB JSON), and the tokio runtime's
cooperative budget naturally rebalances CPU across the 64 tasks.

Expected on next CI:
- avg_parse drops from 30ms wall → ~5-10ms wall (close to CPU-only)
- preload_wall drops from 5.4s → ~3.5-4s for cold runs
- p1 hyperfine wall drops from 3.10s → 2.3-2.5s, narrowing the gap
  to manifest-bench's 2.12s ceiling

If parse becomes the new bottleneck (CPU-bound), next round could
look at partial parse / lazy field access. If wall doesn't drop,
hypothesis is wrong and we look elsewhere (BFS, dedup, lockfile).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 1 (inline parse) reverted on data: GHA showed +0.37s p1
regression because parse blocked tokio runtime workers, dropping
eff_parallel 42 → 35 even though per-fetch work-time fell. avg_request
went up from 35ms → 52ms — symptomatic of socket reads being delayed
by the parsing task on the same worker.

  metric           round 0 (rayon)  round 1 (inline)
  p1 wall          3.27s            3.64s   ⚠️ +0.37s
  avg_parse        30ms (queued)    300µs   ✓
  avg_request      35ms             52ms    ⚠️ +17ms (worker contention)
  eff_parallel     42               35      ⚠️

Round 2 attempts the third option: `tokio::task::spawn_blocking`.

  - rayon's pool was too small (num_cpus = 2 on GHA) — 64 concurrent
    parses queued behind 2 workers, parse wall 30ms.
  - inline parse held tokio worker hostage during simd_json call,
    starving in-flight socket reads.
  - tokio's blocking pool has a much larger default cap (512), so 64
    concurrent parses never queue. Unlike rayon there's no contention
    with the install path's parallel-write rayon usage. Unlike inline
    the tokio runtime workers stay free to drive network I/O.

Expected on next CI:
  - avg_parse drops to ~5-10ms wall (close to CPU floor, no queue)
  - avg_request stays ~35ms (workers free for I/O)
  - eff_parallel returns to ~50, possibly higher
  - p1 wall drops toward manifest-bench's 2.10s ceiling

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 2 moved parse_json_off_runtime off rayon (-0.11s p1). But
fetch-breakdown still showed avg_request 41ms vs round 0's 35ms,
hinting at a second source of rayon contention.

Found it: `extract_core_version_off_runtime` is also on
`rayon::spawn`. On npmjs.org's `!supports_semver` path EVERY fetch
resolves through `resolve_via_full_manifest`, which fetches the
full packument once per package name (deduped via inflight_full)
and then calls `extract_core_version_off_runtime` per (name, spec)
to materialize the chosen version into a `CoreVersionManifest`.

So per fetch we hit rayon TWICE — once for the JSON parse (round 2
moved to spawn_blocking), and once for `get_core_version` (still on
rayon). The second hop has the same head-of-line blocking signature
as the first: 64 concurrent resolves dispatching to a 2-thread
rayon pool.

Round 3: move extract_core_version_off_runtime to spawn_blocking
for the same reasons. The work is JSON lazy-reparse (`raw_json`
sub-tree decoding) — genuinely blocking, well-suited for tokio's
blocking pool.

Expected: utoo p1 wall drops further toward manifest-bench's 2.10s
ceiling. avg_request should fall back from 41ms → ~35ms (rayon
contention removed from the fetch task's await chain).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes for round 4 of p1 optimization:

1. Revert `extract_core_version_off_runtime` from spawn_blocking back
   to rayon::spawn (round 3). Within-run measurement showed +0.42s
   regression vs utoo-next (round 2 was +0.11s). Likely cause: this
   function is called per (name, spec), so multi-spec packages call
   it 2-5x per fetch. spawn_blocking's per-dispatch overhead exceeds
   rayon queue savings at this multiplier.

2. Add `serialize_us` and `cache_export_us` to the p1-breakdown line
   so we can attribute the remaining gap. Currently:

     manifest-bench wall:     2.10s   (pure HTTP ceiling)
     utoo p1 wall (round 2):  3.16s
     gap:                     1.06s

   We have:
     preload_wall  ≈ 2.7s   (logged)
     bfs_wall      ≈ 0.3s   (logged)
     serialize_us  ?
     cache_export_us ?      ← suspected: full manifest deep-clone
                              into ProjectCacheData for ~2730 entries

   Next round will have data to choose between attacking serialize,
   cache export, or the BFS loop body.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 4 measured serialize_us = 15ms and cache_export_us = 34ms — both
tiny — confirming the 1s gap from manifest-bench (utoo p1 = 3.16s vs
mb wall = 2.10s) is not in post-build code.

Per-fetch math also pointed at main-loop bookkeeping:

  manifest-bench: eff_parallel = 52 (sum_work 111s / wall 2.14s)
  utoo preload  : eff_parallel = 43 (sum_work 120s / wall 2.85s)

Same conc=64 cap, but utoo loses 9 effective slots — most likely
the main loop's serial bookkeeping (dedup hash insert, format!
key, extract_transitive_deps, queue push, 3-4 receiver events)
holds the flow between futures.next() returning and the next
fetch dispatch.

This commit splits the main loop into two timed segments:

  preload_loop_dispatch_us: time spent in the `while in_flight <
                            concurrency` block — popping pending,
                            dedup check, futures.push.
  preload_loop_result_us:   time spent processing each completed
                            future — extract_transitive_deps,
                            pending.extend, on_manifest.

If dispatch+result sum approaches preload_wall, the main loop is
the bottleneck and we need to either (a) split processing onto a
dedicated task, or (b) use unbounded futures with a downstream
consumer. If they're small, the gap is elsewhere (per-task
overhead in resolve_package's inflight gates).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 5 main-loop instrumentation showed the preload main loop
itself is fast (15-25ms total dispatch+result). The 0.8s gap from
manifest-bench's 2.10s wall lives INSIDE the spawned fetch tasks.

Per-fetch wall (warm runs):
  measured: avg_request 30ms + avg_body 6ms + avg_parse 2.5ms = ~38ms
  derived:  preload_wall 2.4s × eff_parallel(43) / 2730 = 38ms
  delta:    ~12ms unaccounted per task

That 12ms is `extract_core_version_off_runtime` queueing on rayon's
2-thread pool. extract is called per (name, spec) — for ant-design
that's ~3000+ calls. With pool=2 and 64 concurrent fetches each
dispatching extract, the queue depth grows; each task waits its
turn before extract returns.

Bump rayon pool to `max(num_cpus, 8)` for non-Windows. Sizing the
pool above the CPU count for short blocking JSON ops (parse + extract)
replaces FIFO queueing with parallel dispatch. Real CPU contention
is bounded by num_cpus (the kernel scheduler still gates), so the
extra pool threads just hold ready-to-run dispatches in parallel
rather than serialised in a queue.

Why not just spawn_blocking (round 3 attempt): tokio's blocking pool
defaults to 512 threads, but its per-dispatch overhead was higher
than rayon's even when queueing — round 3 regressed by 0.5s.

Expected: extract queue wait drops from ~12ms to ~1-2ms wall, p1
preload_wall narrows toward manifest-bench's 2.10s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `BuildDepsOptions::skip_preload` so callers without a pipeline
consumer (utoo deps / package-lock-only) can drop the up-front
preload phase entirely. BFS now batches prefetch per level across
the whole frontier, then runs the existing sequential
process_dependency walk against the warmed cache.

For install paths (Context::pipeline_deps_options), skip_preload
stays false so PackageResolved events still feed the
download/clone pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds resolver::fast_preload, a manifest-bench-style flat
FuturesUnordered over service::manifest::fetch_full_manifest. It
warms MemoryCache (both full_manifests and version_manifests slots)
synchronously after each fetch, so the BFS phase is pure cache-hit:
no rayon hop on extract_core_version, no OnceMap gates, no
DiskManifestStore writes, no PackageResolved events.

Wired into service::api::build_deps: when the caller asks to skip
preload (Context::build_deps for `utoo deps`) and there's no warm
project cache, fast_preload runs ahead of build_deps_with_config.
Install paths still go through preload_manifests so the pipeline
keeps its early-start signal.

Also reverts the per-level prefetch I added in 394f6c9 — with
fast_preload pre-warming everything, BFS doesn't need its own
prefetch wave.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v1 of fast_preload called settle_spec inline on the tokio worker —
each settle ran simd_json::to_borrowed_value over the full
manifest's raw bytes (5–10ms per spec) right on the runtime
thread. CI showed it starved sibling fetches: avg_request rose
+3ms, avg_parse jumped 5→11ms, p1_resolve regressed +1.0s vs the
preload+BFS baseline (4.0s vs 3.0s).

Fix: route every settle through extract_core_version_off_runtime
(the same rayon::spawn helper the BFS path uses), and merge fetch
and settle completions into a single FuturesUnordered so
backpressure on either side throttles the other. Sibling specs
that arrived during a fetch are now stashed by name (HashMap, not
linear scan), then dispatched as their own settle futures when
the fetch lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone manifest-bench HTTP-only sweep (npmjs, h1) shows wall
bottoming at concurrency=96 (1817ms) — earlier 256 regression was
caused by rayon-queued parses behind 2 workers, no longer relevant
since fetch parse is on spawn_blocking and settle is rayon-dispatched
off the runtime.

fast_preload's wave-shaped transitive walk currently runs at
eff_parallel ~35 against the 64 cap because pending refills lag
settles; raising the cap to 96 gives headroom for sustained
in-flight on the deep waves without crossing the npmjs per-IP
tail-latency cliff that conc 128+ trips.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… path

`UnifiedRegistry::resolve_version_manifest`'s first cache check
(service/registry.rs:347) keys on `(name, spec)` — the original spec
string the caller passed, e.g. `^4.0.0`. settle_future was only
populating `(name, resolved_version)` (e.g. `4.17.21`), so on every
BFS edge for `lodash@^4.0.0`-style specs the warm path missed and
fell into the OnceMap inflight gate + `resolve_via_full_manifest`
re-walk before recovering the manifest from the
`(name, resolved_version)` slot we'd already set.

Now settle writes both keys so BFS hits the early-return at
service/registry.rs:347 with no further dispatch. Saves ~1
OnceMap+resolve_target_version round-trip per unique (name, spec)
the BFS encounters (≈3000 calls on ant-design-x).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous fast_preload (v2) dispatched primary settles to rayon as
separate FuturesUnordered futures. CI breakdown showed
eff_parallel ~44 against the conc=96 cap — the wave-shaped
transitive walk was held back by settle dispatch RTT: each fetch
landed → primary settle queued → settle popped → only then did
`pending` get transitive deps and fill the next dispatch wave.

v3 folds the primary settle into the fetch task itself via
`tokio::task::spawn_blocking`. The fetch task does the network
round-trip and the primary version-extract on the same blocking
pool slot, then returns with the resolved CoreVersionManifest
attached. Main loop pulls one Fetched event, immediately extends
`pending`, no second `next().await` to wait through the queue.

Sibling specs (rare; same name, different range) still go through
the rayon settle_future path so the primary path stays lean.

Carries primary_spec through FastEvent so the fused path can
populate both `(name, primary_spec)` and `(name, resolved_version)`
cache slots — preserves the 6455852 BFS fast-path win.

FetchOutcome enum replaces by-value FetchManifestResult to avoid a
full FullManifest clone (HashMap+Vec) per fetch event.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…json

The fast_preload hot path was paying TWO simd_json passes per
manifest:
  1. fetch_full_manifest's parse_json_off_runtime did a typed
     simd_json::serde::from_slice<FullManifest> (envelope + IgnoredAny
     visitor on `versions` keys, ~3-5ms on a 100KB body).
  2. Primary settle re-parsed the same raw bytes with
     simd_json::to_borrowed_value (~5-10ms) to extract one version's
     subtree.

Both passes went through simd_json's Tape constructor — duplicated
work. CI showed avg_parse 5-7ms × 2700 fetches = 14-19s of CPU sum
on 2-core GHA, where the spawn_blocking pool's overlapping schedule
masked some of the cost but not all.

Adds `service::manifest::fetch_full_manifest_with_settle`: same HTTP
+ retry + ETag machinery as `fetch_full_manifest`, but the parse
step does ONE `to_borrowed_value` and extracts:
  * envelope (`name`, `dist-tags`, `versions` keys) into FullManifest
    manually (no typed serde), and
  * the resolved version's subtree as a typed CoreVersionManifest
    (serde-deserializing that single subtree via the borrowed value).

fast_preload's fetch task switches to this entry point — primary
settle is now a free byproduct of the fetch parse, not a separate
`to_borrowed_value` pass. Sibling specs (same name, different
range) still go through the rayon settle_future path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After 671ac98's combined-parse fetch path eliminated the
double simd_json pass, the spawn_blocking pool's contention
ceiling rose enough that bumping concurrency past 96 no longer
queues parses behind 2-core CPU. manifest-bench's most recent
good-network sweep on GHA showed conc=128 hitting 1500ms vs
conc=96 at 1566ms — small but real headroom for fast_preload's
late-wave saturation now that initial waves fill faster.

Risk: on slower-network runs (npmjs per-IP throttle), conc=128
widens p99. Earlier conc-sweep data was mixed — accepting that
variance for the average-case improvement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
542d7f1's conc=128 bench landed in a slow-network run (mb best
2010ms vs 1500ms in the prior good-network run; bun also bumped
to 2.14s vs 1.83s). Adjusted gap to mb best stayed flat (~700ms
either way), so conc=128 didn't beat 96 across runs.

Picking 96 as the conservative default: at-or-near best on every
GHA run we've measured, never the worst, and leaves headroom for
npmjs's per-IP throttling to absorb without compounding p99.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…preload)

Adds resolver::mb_resolve module + service::build_deps_mb entry point
as a parallel-track alternative to fast_preload, structured to
match manifest-bench's main-loop shape as closely as correctness
allows. Hypothesis under test: fast_preload's eff_parallel caps at
~50/96 because the FastEvent enum match + cache writes + sibling
deferred bookkeeping in the main loop competes with tokio runtime
workers for the 2 CPU cores on GHA, stalling socket I/O drive.

mb_fetch pushes ALL per-fetch work into the spawned future itself
(including cache writes), so the main loop is reduced to:

  while let Some(deps) = futs.next().await {
      pending.extend(deps);
      refill_to_cap(...);
  }

Sibling specs (multiple ranges on same package) are NOT deferred at
queue level — racing fetches for the same name both proceed. The
race converges naturally: first fetch to land populates
full_manifests, subsequent racers find the cache hit on entry and
short-circuit to a sibling-style settle. Wastes ~5-50 network
requests in real workloads but eliminates the HashMap probe + drain
overhead from the hot loop.

Wired in via UTOO_RESOLVE=mb env var:
- Context::build_deps (utoo deps) routes through build_deps_mb
- pipeline::resolve_with_pipeline (utoo install) also routes
  through it; pipeline workers still start but don't pipeline
  during fetch (mb_fetch emits no PackageResolved events) — install
  becomes phase-sequential, useful for resolve-phase A/B.

bench script enables UTOO_RESOLVE=mb so CI measures the new path
against existing baselines (utoo-next/utoo-npm/bun ignore the env
var). Comment the export line to A/B back against fast_preload.

Old fast_preload + UnifiedRegistry paths untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v1/v2 ran parse work in spawn_blocking inside each fetch future,
which competed with tokio runtime workers for the 2 GHA cores. CI
showed eff_parallel capped at 47/96 vs manifest-bench standalone's
75/96 on the same box. Hypothesis: parse CPU starves socket drive.

v3 separates the two phases:

* PHASE 1 — `mb_style_pure_fetch` is a structural copy of
  `manifest-bench`'s main loop: future body does ONLY GET + body
  recv, refill 1-for-1 on completion. Zero per-future CPU work, so
  tokio runtime workers retain full CPU for socket drive.

* PHASE 2 — bulk rayon par_iter parse: for each body, parse
  `FullManifest` envelope via simd_json::to_borrowed_value, resolve
  every queued spec for this name against the just-parsed manifest,
  populate cache slots, collect transitive deps. Runs off the
  tokio runtime entirely (spawn_blocking → rayon par_iter).

Phases alternate until pending exhausted. Typical project: 3-5
iterations as the dep tree fans out wave by wave.

The point of the split is the `phase1_http_wall` trace — measured
in isolation from any parse work, it should match manifest-bench's
standalone wall (~1.5-2.0s for 2733 names @ conc=96). If it does,
the remaining gap to mb is concentrated in phase 2 work, which is
inherent to discovering transitive deps from a non-flat name list.

Tracing per iteration:
  p1-breakdown mb_fetch iter=N phase1_http_wall=Xms n=Y bytes=Z
  p1-breakdown mb_fetch iter=N phase2_parse_wall=Xms settles=Y new_transitives=Z
  p1-breakdown mb_fetch total_wall=Xms iters=Y

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v3 dropped the (name, spec) HashSet from v1/v2 thinking name-level
dedup via done_names was sufficient. It wasn't: sibling-settle's
extract_transitive can re-introduce specs we've already settled
(peer/optional dep cycles trivially trigger this), so the outer
while-loop never terminated.

CI 25589397823 hung on `Run phase-isolated benchmark · npmjs` for
~25 min before being cancelled — the bench's first utoo p1_resolve
hyperfine run got stuck in an infinite settle loop.

Fix: maintain `seen_specs: HashSet<(String, String)>` across all
iterations; filter both initial seed and every wave of new
transitives through it before extending pending_specs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New crate `crates/preload-bench/` is a fully-standalone bench that:
* Uses the SAME HTTP setup as `manifest-bench` (own reqwest::Client
  built per rep with aws-lc-rs TLS, pool_max_idle_per_host(256), no
  proxy, default DNS, no retry, h1_only).
* Discovers names by walking transitive deps from a package.json
  root — instead of consuming a flat name list like manifest-bench.
* Per-future does GET + body recv + spawn_blocking parse → returns
  transitive deps → main loop refills on completion.
* No dependency on ruborist or any utoo internals (own simd_json,
  own dedup, own everything).

The point: prove (or disprove) that a fully ruborist-independent
streaming preload can hit standalone manifest-bench's wall on the
same workload. ruborist's path runs at ~2.18s for ant-design's
~2700 names; manifest-bench standalone runs the same workload at
~1.6s. The gap could be in any number of things — DNS layer, retry,
pool config, parse-CPU contention, registry single-flight gates.
preload-bench eliminates all of those simultaneously so we can read
the wall directly.

Wired into bench-phases-linux: builds + uploads preload-bench
binary alongside manifest-bench, then runs a conc=64/96/128 sweep
against the same project after the standalone manifest-bench sweep.

bench script reverts UTOO_RESOLVE=mb so utoo runs default
fast_preload — gives a third datapoint (utoo wall on integrated
path) alongside manifest-bench (HTTP-only ceiling) and preload-bench
(streaming-with-walk ceiling).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y path

Step 1 of staged service-layer ablation. Rewrites mb_resolve as a
fully self-contained streaming preload mirroring preload-bench's
loop shape verbatim, but living inside ruborist so it can populate
MemoryCache for the BFS phase.

Bypasses every other ruborist service layer:
  * service::http::get_client — own reqwest::Client built per call,
    no global LazyLock, no shared_resolver dns layer, no
    connect_timeout, pool_max_idle_per_host(256).
  * service::manifest::fetch_full_manifest_with_settle — own GET +
    body.bytes() + spawn_blocking(simd_json::to_borrowed_value),
    no RetryIf, no FETCH_TIMINGS.
  * service::registry::UnifiedRegistry — no OnceMap, no
    ManifestStore, no EventReceiver.

Only service::* touched is MemoryCache writes (DashMap inserts) so
BFS has data to read from.

PM is unaware: dispatch happens entirely inside
service::api::build_deps when skip_preload=true and no warm cache.
Removes the previous UTOO_RESOLVE=mb env-var gating from
pm::helper::ruborist_context::Context::build_deps and
pipeline::resolve_with_pipeline. Removes the now-unused
service::api::build_deps_mb sibling entry point.

Expected: utoo p1_resolve drops from ~2.67s toward preload-bench's
~2.57s (or better since ruborist fetches fewer names than
preload-bench). The remaining gap to mb's ~1.99s would isolate
incremental layer effects we add back next:
  - tokio runtime config / cooperative scheduling
  - reqwest::Client provider differences (TLS, DNS)
  - cache layer (DashMap vs DiskManifestStore reads on the cold path)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mb_fetch

Step 2 of staged service-layer ablation. Targets the two gaps
left after step 1:

1. mb_fetch (in ruborist): 2300ms / 2735 = 0.84 ms/name
   manifest-bench (standalone): 2010ms / 2735 = 0.72 ms/name
   ~290ms gap on same workload, same conc.

2. BFS phase: 305ms wall against a fully-warm MemoryCache.
   Origin unclear — could be graph mutations, repeated cache
   lookups via the inflight gate, or event dispatch.

Changes:

* TLS provider — adds rustls (aws-lc-rs) + rustls-native-certs to
  non-wasm-non-macos targets. mb_resolve's `build_mb_client` now
  uses `use_preconfigured_tls(aws_lc_rs)` matching
  preload-bench / manifest-bench exactly. The reqwest crate's
  `rustls-tls-native-roots` feature on Linux still bundles ring
  for service::http's global client; the two providers coexist.

* mb_fetch instrumentation — per-future `wall_us` (network +
  parse + cache writes) and `net_us` (network only) reported in
  the trace line as `eff_par_full`, `eff_par_net`, `avg_wall`,
  `avg_net`. Same shape as manifest-bench's `avg_conc` so we can
  compare directly.

* BFS instrumentation — splits run_bfs_phase wall into:
    - `collect_us`: collect_unresolved_edges sum
    - `resolve_us`: process_dependency .await sum
    - `event_us`: post-resolve event dispatch (Resolved /
      PackagePlaced / Reused / Skipped) sum
  Plus `levels` and `edges` counters. Trace line lets us
  attribute the 305ms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 3 of staged service-layer ablation. Targets the 305 ms BFS
phase observed against a fully-warm MemoryCache — 100 % attributed
to process_dependency.await sum (graph mutations) per d9fb207's
new bfs instrumentation.

Adds:
* `process_dependency_with_resolved` in builder.rs — sync variant
  of process_dependency for the registry-resolved case. Skips
  spec-routing (only Registry handled), skips resolve_registry_dep
  (resolved is the parameter), skips override re-resolve. Reuses
  existing helpers (find_compatible_node, create_package_node,
  add_edges_from, mark_dependency_resolved, update_node_type_from_edge).
* `mb_fetch_with_graph` in mb_resolve.rs — folded streaming preload
  + graph build. Each fetch result triggers inline
  process_dependency_with_resolved for every parent edge waiting
  on (name, spec). New nodes' edges feed back into pending /
  edge_targets, so the walk continues streaming-style.
  CPU work (graph mutations, ~305 ms total) overlaps with network
  IO (mb_fetch's wall ~2.4 s).

Wires `service::api::build_deps` to use mb_fetch_with_graph for
the lockfile-only path (skip_preload + cold cache). The
follow-up build_deps_with_config still runs to handle any
non-registry edges left unresolved (workspace / git / http /
file); on registry-only workloads it's near no-op.

Install path unchanged — pipeline_deps_options keeps preload +
PackageResolved early-start signal for tgz download.

Expected: utoo p1 wall drops from ~2.76 s toward mb_fetch wall +
serialize ≈ 2.4-2.5 s on good network. Tracing line:
  p1-breakdown mb_fetch_with_graph wall=Xms ok=N fetch=N
  settle=N sum_wall=Xms sum_net=Xms sum_graph=Xms avg_net=Xus
  eff_par_full=N.N eff_par_net=N.N unresolved_targets=N

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c02bb15 had unresolved_targets=583 in trace — `enqueue_node_edges`
was unconditionally pushing (parent, edge_id) into edge_targets
without checking if the (name, spec) was already cached. When a
later transitive's edge referenced an already-fetched (name, spec),
no fetch result would land to drain that bucket — the parent edges
sat unresolved, potentially missing packages from the lockfile.

Fix: enqueue_node_edges now checks cache.get_version_manifest
first. Cache hit → process_dependency_with_resolved inline (with a
work_stack to recurse into newly-Created nodes' edges). Cache
miss → original behavior (stash in edge_targets, push to pending).

Side effect: more inline graph mutation work in the seed phase
(workspace + root edges that hit warm cache from previous specs in
the same root). Should reduce the number of fetch-result events
that need to do graph mutations downstream, since orphan edges no
longer accumulate.

Targets the correctness bug from c02bb15 trace; perf impact TBD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 700ms gap between utoo p1 (folded mb_fetch_with_graph) and
manifest-bench standalone needs network-layer evidence. Same
workload, same conc, same network → why does utoo wall trail by
700ms when per-fetch latency is matched (avg_net=53us = mb p50=40us
ish)?

Hypotheses to test via pcap diff:
* Fewer concurrent TCP streams in flight at any moment (utoo's
  main loop CPU steals tokio dispatch capacity → in-flight count
  drops below conc cap)
* More TLS handshakes (utoo's connection pool isn't reusing as
  effectively as mb's per-rep fresh client)
* Larger inter-packet gaps per stream (utoo's runtime pauses mid
  download)
* Different concurrent-stream-time profile (wave shape)

Adds two captures at end of pm-bench-pcap.sh:
  manifest-bench-c96 — flat lockfile-derived names @ conc=96
  preload-bench-c96  — transitive walk @ conc=96 (matches utoo's
                       walk shape, but no graph build)

Each captured with the same tcpdump + iostat as the existing
utoo / utoo-next / bun captures. analyze_pcap globs *.pcap so the
new files get the same TCP signal extraction (zwin / retx /
dup_ack / per-stream gap p50/p99/max / distinct streams).

Workflow: downloads manifest-bench-linux-x64 +
preload-bench-linux-x64 artifacts (built by build-linux's
benchmark-label conditional steps) into the pm-bench-pcap-linux
job env so pm-bench-pcap.sh can find them.

Trigger: workflow_dispatch with target=pm-bench-pcap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous pm-bench-pcap artifact was 2GB (raw .pcap files for every
PM × phase × bench), making the round-trip download impractical
just to read JSON metrics. Adds a separate `pm-bench-pcap-summaries`
artifact containing only the *.json / *.log / *.iostat.txt / dns.txt
files — KB scale, downloads in seconds.

Raw pcap artifact is preserved for cases where we want to re-run
tshark with different filters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pm-bench-pcap artifact is ~2 GB (pcap binaries dominate). gh
run download keeps timing out before completion. Two fixes:

1. New `pm-bench-pcap-summaries` artifact uploads only the JSON
   summaries + .log + iostat.txt + dns.txt (small, fast download).
   The full pcap artifact stays for deep inspection when needed.

2. End of pm-bench-pcap.sh prints a tab-separated comparison
   table (name, wall_s, packets, streams, zwin, retx, dup_ack,
   gap_p99_us, gap_max_us) to stdout, so the data is visible in
   the CI run log without downloading anything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…raph

The pcap evidence (utoo-resolve zwin=71 vs mb-c96 zwin=49) confirmed
main loop CPU was starving tokio runtime workers from polling
sockets. Inline graph mutations (sum_graph=450ms across the fetch
loop) blocked the worker between awaits, so TCP receive buffers
filled and the server paused sending — directly extending wall.

This refactor:
* Spawns `graph_worker` as a separate tokio task (gets its own
  runtime worker thread on multi-thread runtime). Owns the
  DependencyGraph + edge_targets + seen_specs.
* Main loop owns FuturesUnordered + body_cache + dispatch state.
  No graph mutations on this path.
* mpsc channels: main → graph (FetchEventMsg, just the name —
  cache writes already in the future), graph → main (Vec<Dep>
  new pending specs to extend the fetch queue).
* `tokio::select!` with `biased` drains specs first to unblock
  fetch dispatch.
* `in_flight_graph` counter tracks outstanding messages to graph
  worker — termination = futs empty + in_flight_graph == 0.

Function signature changed: takes `mut graph: DependencyGraph` by
value, returns `(DependencyGraph, MbFetchStats)` since the worker
task needs ownership of the graph (can't borrow across spawn).
api.rs caller threads the graph through.

Expected: zwin drops back toward mb's ~49 (no more main loop
starvation), eff_par_net climbs from 56 toward mb's 72, wall
saves ~200ms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
elrrrrrrr and others added 8 commits May 9, 2026 23:54
Plumb the PipelineReceiver through the folded mb_fetch_with_graph path
so install (`utoo install`) gets the same channel-separated fetch + graph
architecture as `utoo deps`, with download/clone pipelines starting as
early as the legacy preload+BFS path:

- mb_fetch_with_graph now takes Arc<R: EventReceiver + 'static>; main
  loop emits PackageResolved on each fetch land (looked up via cache
  with the new FetchOutcome::primary_spec), graph_worker emits
  PackagePlaced on ProcessResult::Created.
- service::api::build_deps wraps the caller-supplied receiver in Arc
  once and shares it between mb_fetch_with_graph and
  build_deps_with_config; adds + 'static bound on R.
- pipeline_deps_options sets skip_preload=true so install routes
  through the same folded path as the lockfile-only command.

CI will validate that p1 resolve continues at/below 2.5s while
p0_full_cold and p3_cold_install do not regress (download + clone
pipelines remain saturated via emitted events).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously spawn_fetch / spawn_settle used the raw dep key as both
the registry path segment and the cache lookup key. For an npm-alias
dep like \`\"ms\": \"npm:raw-body@2.1.3\"\` this hit
\`registry/ms\` instead of \`registry/raw-body\`, parsed ms's manifest
against \`npm:raw-body@2.1.3\`, and ultimately installed the real ms
into \`node_modules/ms/\` rather than raw-body. e2e
\`utoo-pm.sh:466\` (\"top-level ms should be raw-body\") caught this
on d1cf53e.

Fix:
- spawn_fetch / spawn_settle call \`normalize_spec\` to split out the
  real package name + spec; URL hits \`registry/{real_name}\` and the
  combined parse runs against \`real_spec\` so version resolution sees
  the right manifest envelope.
- Cache writes go under both keys: the original
  \`(alias_name, alias_spec)\` so \`graph_worker\` finds the manifest
  via \`edge_targets\`, and the normalized
  \`(real_name, resolved_version)\` for direct-dep dedup.
- Main loop dedup state (in_flight_names / deferred_by_name /
  body_cache) keys by real_name so two distinct aliases pointing at
  the same registry package share dedup; deferred entries store
  \`(alias_name, spec)\` so the drain spawns spawn_settle with the
  correct cache key.
- Adds \`real_name\` to FetchOutcome so the deferred-drain step can
  look up by real name without re-normalizing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GHA ubuntu-latest is a 2-core runner; tokio's default worker_threads
= num_cpus = 2. The install hot path multiplexes four concurrent task
families on those workers:
- mb_fetch_with_graph main loop (drives sockets + FuturesUnordered)
- graph_worker (\`tokio::spawn\`, CPU-heavy)
- pipeline download workers (PackageResolved → tarball fetch)
- pipeline clone workers (PackagePlaced → hardlink/clonefile)

Under that load, graph_worker can monopolize a worker thread for tens
of ms at a stretch, starving the main loop's socket polling. The
symptom is a wave-shape collapse (eff_par_full 73-77 → 40, mb_fetch
wall 4-6s → 10s+) that pushes p0_full_cold tail by 3-5s on the
affected run.

Floor worker_threads at \`max(num_cpus, 4)\` so the runtime always has
headroom to keep the resolve hot path on its own worker even when
the install pipeline saturates the others. No-op on 4+ core machines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
graph_worker is pure CPU + channel IO; on a multi-thread tokio runtime
it sat on a worker thread that the install pipeline (download + clone
+ extract) and the resolve main loop were also competing for. Under
the 2-core GHA ubuntu runner that contention produced the eff_par_full
collapse (73-77 → 40) and 4-6s → 10s+ mb_fetch wall on the p0/p1
outlier runs.

Move it to \`tokio::task::spawn_blocking\`:
- Convert graph_worker from \`async fn\` to sync \`fn\`; channel IO
  uses \`mpsc::Receiver::blocking_recv\` and \`mpsc::Sender::blocking_send\`,
  which are tokio-supported when called outside an async context.
- Spawn site wraps it in a \`move ||\` closure so spawn_blocking owns
  the captured state. \`graph_handle.await\` keeps the same shape — a
  spawn_blocking JoinHandle is awaitable.

The blocking pool defaults to 512 threads, so reserving one slot for
graph_worker has no scheduling effect on download/clone/extract
spawn_blocking calls. Net effect: the resolve worker can no longer be
preempted by graph mutation bursts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…client builder

PR #2916/#2920's channel architecture broke utooweb-ci wasm build:

  error[E0277]: \`Rc<RefCell<wasm_bindgen_futures::Inner>>\`
                cannot be sent between threads safely
  error[E0599]: no method named \`no_proxy\` found for struct
                \`ClientBuilder\` in the current scope
  error[E0277]: \`*mut u8\` cannot be sent between threads safely
  note: required because it appears within
        the type \`reqwest::wasm::AbortGuard\`

Two root causes:

1. \`mb_resolve.rs::mb_fetch_with_graph\` calls
   \`tokio::task::spawn_blocking(move || graph_worker(...))\` which
   requires Send on the closure. wasm32 reqwest's \`AbortGuard\`
   contains \`Rc<RefCell<...>>\` which is not Send. wasm tokio is
   single-threaded with no \`spawn_blocking\` semantics either.

2. \`mb_resolve.rs::build_mb_client\` wasm32 variant called
   \`.no_proxy()\` which doesn't exist on wasm reqwest's
   \`ClientBuilder\` (proxy settings live in browser fetch config).

Fix:

- Gate \`mb_fetch_with_graph\` and its caller in \`service::api::build_deps\`
  with \`#[cfg(not(target_arch = "wasm32"))]\`. wasm32 falls back to
  the legacy preload + BFS path (\`build_deps_with_config\`).
- Drop \`.no_proxy()\` from the wasm32 \`build_mb_client\` body.

Net effect: native keeps the channel mb_fetch_with_graph for the p1
win; wasm regains a buildable resolve path via legacy preload+BFS.
The legacy code in \`resolver::preload\` and BFS in
\`resolver::builder\` must remain reachable for the wasm carve-out;
plans to delete legacy code post-channel must keep a wasm32-gated
fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2924 cfg-gated mb_fetch_with_graph but utooweb-ci still failed:

  error[E0432]: unresolved import \`crate::resolver::mb_resolve::mb_fetch_with_graph\`
  error[E0277]: \`Rc<RefCell<wasm_bindgen_futures::Inner>>\` cannot be sent
                between threads safely
                in \`crates/ruborist/src/resolver/fast_preload.rs:225\`
                in \`crates/ruborist/src/resolver/mb_resolve.rs:229\`

Two more places miss wasm gating:

1. \`fast_preload.rs\` builds \`Pin<Box<dyn Future + Send>>\` over reqwest
   in its FuturesUnordered. wasm reqwest's response futures hold
   \`Rc<RefCell<wasm_bindgen_futures::Inner>>\` which is !Send → entire
   module won't compile on wasm32.
2. \`mb_resolve.rs::Fut\` type alias is also \`Pin<Box<dyn Future + Send>>\`
   so even non-mb_fetch_with_graph functions in the file fail Send.

Fix at the module boundary in \`resolver/mod.rs\`:

  #[cfg(not(target_arch = "wasm32"))]
  pub mod fast_preload;
  #[cfg(not(target_arch = "wasm32"))]
  pub mod mb_resolve;

Plus gate the import in \`service::api\` with the same cfg. wasm callers
keep using legacy preload + BFS via \`build_deps_with_config\`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2916 added \`skip_preload\` field to \`BuildDepsOptions\`. Native
callers in \`pm/helper/ruborist_context.rs\` were updated, but
\`utoo-wasm/src/deps.rs\` constructs the options inline and was missed:

  error[E0063]: missing field \`skip_preload\` in initializer of
                \`BuildDepsOptions<_, _>\`

Set \`skip_preload: false\` so wasm callers stay on the legacy preload
+ BFS path (channel mb_fetch_with_graph requires multi-thread tokio
+ Send-safe types, both unavailable on wasm32).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two micro-opts targeting the 0.5s framework overhead between utoo p1
(2.5-2.7s) and manifest-bench's pure-network floor (2.0s) on linux GHA.
PR #2924's clean linux p1 best run was 2503ms; manifest-bench cap=128
hit 2014ms.

1. **Sibling two-phase parse**: when spawn_settle fires, the original
   fetcher has already cached the FullManifest envelope under
   real_name. Skip the to_borrowed_value + dist_tags + versions_keys
   + FullManifest construction and only deserialize the requested
   version subtree via \`extract_core_version_off_runtime\` (rayon).
   For ant-design's 1843 siblings × ~1ms saved CPU = ~25ms wall on
   75-worker eff_par_net. Falls back to combined parse if cache miss
   (rare race: settle dispatched before fetcher's cache write committed).

2. **Channel buffer cap × 2 → cap × 4**: when graph_worker briefly
   bursts CPU on a wave-end batch, main loop's \`fetch_tx.send().await\`
   can fill the cap*2 channel and block. Blocked main loop stops
   dispatching → eff_par_net dips. 4× buffer absorbs ~2 full caps
   worth of pending events without backpressuring the fetch loop.
   Memory cost trivial (FetchEventMsg = 1 String, ~24 bytes).

Together expected: 50-150ms p1 wall reduction by lifting eff_par_net
from ~75 toward the manifest-bench reference of 92.

Falsification: if eff_par_net stays at ~75 in the breakdown line, the
bottleneck is elsewhere (sibling defer dedup, wave-shape, or
graph_worker batching) and we need the bigger experiments.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr added the benchmark Run pm-bench on PR label May 11, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance optimizations for the dependency resolution process, particularly for the lockfile-only path. Key changes include the addition of standalone benchmarks, a new folded streaming graph build that overlaps network I/O with graph mutations, and improved thread pool management for Rayon and Tokio to prevent worker starvation. New instrumentation has also been added to provide detailed timing data for manifest fetches. Review feedback identifies a potential deadlock risk in the channel-based communication between the main loop and the graph worker, a performance bottleneck in edge target processing due to linear scans, and the use of unstable Rust features like let_chains which may impact compilation on stable toolchains.

// Send to graph worker. `send().await` only
// blocks if channel is full (cap * 2 buffer);
// under steady state shouldn't happen.
if fetch_tx.send(FetchEventMsg { name: out.name }).await.is_ok() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a potential deadlock risk between the main loop and the graph_worker. The main loop blocks on fetch_tx.send(...).await when the channel is full, which prevents it from polling specs_rx. Simultaneously, the graph_worker blocks on specs_tx.blocking_send(...) when its channel is full, preventing it from receiving from fetch_rx. While the buffer size is large (cap * 4), a workload with many dependencies (e.g., a large lockfile or many cache hits) could still fill both channels and cause a hang. Consider using try_send with an internal overflow queue in the main loop, or ensuring the main loop continues to poll specs_rx while waiting for space in fetch_tx.

Comment on lines +1002 to +1006
for (k_name, k_spec) in primary_keys {
let Some(core_arc) = cache.get_version_manifest(&k_name, &k_spec) else {
continue;
};
let resolved = ResolvedPackage {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Iterating over all keys in edge_targets to filter by msg.name results in $O(N)$ complexity per message, leading to $O(N^2)$ overall complexity for the resolution process. For large dependency graphs, this linear scan will significantly degrade performance. It is recommended to refactor EdgeTargets into a nested map structure (e.g., HashMap<String, HashMap<String, Vec<...>>>) to allow $O(1)$ access to all relevant specs for a given package name.

Comment on lines +1041 to +1043
if let Some(node) = graph.get_node(new_idx)
&& let NodeManifest::Registry(ref manifest) = node.manifest
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of let_chains (combining let bindings with &&) is an unstable Rust feature. Unless the project is explicitly configured to use a nightly toolchain with the #![feature(let_chains)] attribute, this code will fail to compile on stable Rust. It is recommended to use nested if let statements instead. This pattern is also used in several other places in this PR (e.g., line 886 and in preload-bench/src/main.rs).

elrrrrrrr and others added 2 commits May 11, 2026 11:58
Two changes targeting the eff_par_net 75 → 92 gap (vs manifest-bench).

1. **Revert two-phase sibling parse** (from prior commit). Bench data
   on PR #2929 showed eff_par_full collapsed 97 → 76 because
   \`extract_core_version_off_runtime\` uses \`rayon::spawn\` whose
   dispatch latency dropped parallelism. Wall stayed flat (~2.5s) but
   network parallelism halved. Two-phase only saves ~1ms CPU per
   sibling (envelope extract); rayon dispatch latency exceeds that.
   Stick with combined \`spawn_blocking\` parse — bounded by 512-slot
   blocking pool.

2. **\`name_index: HashMap<String, Vec<String>>\` side index for
   graph_worker O(1) lookup**. Previously graph_worker scanned all
   \`edge_targets.keys()\` per fetched-msg to find primary keys for
   that name. With ant-design's 4645 names × 4645 keys = 21M
   iterations across the resolve phase, this added ~250-500ms wall
   on the graph_worker hot loop and bottlenecked main loop's fetch
   dispatch (graph back-pressure → eff_par_net stuck at ~62-75 vs
   manifest-bench's 92).

   The index is maintained in lockstep with \`edge_targets\`: insert
   on first-seen spec (gated by \`seen_specs.insert\`), remove on
   graph_worker drain. Both \`enqueue_node_edges\` and
   \`enqueue_node_edges_into\` thread the index through.

Falsification: bench breakdown should show eff_par_net climbing from
~62 (PR #2929) toward manifest-bench's 92. p1 wall expected to drop
from 2.5s toward 2.1-2.2s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug in prior O(N²)→O(N) lookup change: name_index.remove() pulled all
specs for the fetched name, and any spec that hit cache miss in the
\`for k_spec in primary_keys\` loop got dropped from name_index but
left dangling in edge_targets. Result: 821-882 unresolved_targets at
end of mb_fetch_with_graph (was 0 before). Tree shape correctness
risk for sibling specs that defer past the fetcher's msg.

Race scenario:
1. (\"react\", \"^17\") + (\"react\", \"^18\") in name_index/edge_targets
2. spawn_fetch fires for \"^17\" only (\"^18\" deferred via in_flight_real_names)
3. fetcher writes cache for (react, ^17) — NOT for (react, ^18) yet
4. fetcher sends fetch_tx{name=\"react\"}
5. graph_worker drains name_index[\"react\"] = [\"^17\", \"^18\"]
6. \"^17\" hits cache → processed + edge_targets.remove ok
7. \"^18\" misses cache → continue → spec lost from name_index
8. (Later) sibling settle for \"^18\" lands, sends fetch_tx{name=\"react\"}
9. graph_worker drains name_index[\"react\"] = None → no-op
10. (react, ^18) orphaned in edge_targets

Fix: collect cache-missed specs into \`retry_specs\`, re-insert into
name_index at end of msg loop. Next msg for same name will see them.
Constant-overhead — at most O(siblings) per msg.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · 094cd17 · linux (ubuntu-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 9.27s 0.41s 10.39s 10.14s 744M 325.5K
utoo-next 8.27s 0.12s 10.65s 12.05s 966M 122.2K
utoo-npm 8.49s 0.21s 11.18s 12.66s 1.35G 192.5K
utoo 7.94s 0.50s 10.82s 11.97s 1.63G 213.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 13.9K 16.6K 1.20G 6M 1.89G 1.76G 1M
utoo-next 115.3K 85.7K 1.17G 4M 1.73G 1.73G 2M
utoo-npm 125.1K 88.7K 1.17G 4M 1.73G 1.73G 2M
utoo 108.3K 86.1K 1.17G 5M 1.73G 1.73G 3M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 1.93s 0.06s 4.05s 1.01s 495M 168.2K
utoo-next 3.10s 0.06s 5.50s 1.89s 614M 81.5K
utoo-npm 3.16s 0.02s 5.55s 1.91s 615M 86.0K
utoo 2.63s 0.04s 5.36s 0.97s 1.03G 147.8K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 7.9K 4.3K 205M 3M 108M - 1M
utoo-next 68.4K 115.7K 201M 2M 7M 3M 2M
utoo-npm 68.5K 111.6K 201M 2M 7M 3M 3M
utoo 38.8K 8.5K 201M 2M - 3M 3M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 6.74s 0.43s 6.25s 9.88s 636M 215.9K
utoo-next 6.85s 1.47s 5.03s 10.90s 526M 63.5K
utoo-npm 8.15s 2.59s 5.47s 11.25s 991M 129.1K
utoo 7.32s 2.62s 5.12s 10.92s 669M 82.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 3.5K 6.4K 1.00G 3M 1.78G 1.78G 1M
utoo-next 106.1K 54.5K 1000M 3M 1.73G 1.73G 3M
utoo-npm 103.5K 61.3K 1000M 2M 1.73G 1.73G 3M
utoo 99.8K 76.0K 1000M 3M 1.73G 1.73G 3M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.54s 0.01s 0.19s 2.44s 139M 32.3K
utoo-next 2.31s 0.05s 0.49s 3.83s 79M 18.4K
utoo-npm 2.34s 0.14s 0.53s 3.84s 82M 19.3K
utoo 2.34s 0.08s 0.52s 3.84s 79M 18.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 267 69 29K 19K 1.89G 1.76G 1M
utoo-next 41.1K 19.0K 2K 10K 1.73G 1.72G 2M
utoo-npm 43.9K 20.4K 15K 15K 1.73G 1.72G 2M
utoo 40.4K 17.8K 3K 10K 1.73G 1.72G 2M

npmmirror.com

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 42.06s 28.39s 9.02s 9.53s 539M 412.7K
utoo-next 20.11s 0.66s 7.05s 13.05s 503M 57.5K
utoo-npm 47.12s 33.81s 7.66s 14.38s 834M 113.9K
utoo 20.29s 8.08s 11.80s 13.36s 1.57G 210.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 58.4K 3.4K 1.16G 10M 1.89G 1.77G 2M
utoo-next 206.0K 102.9K 1019M 8M 1.73G 1.73G 2M
utoo-npm 241.1K 142.1K 1020M 10M 1.73G 1.73G 2M
utoo 194.0K 145.9K 1.13G 9M 1.73G 1.73G 3M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 3.79s 0.03s 3.66s 1.30s 514M 228.9K
utoo-next 5.48s 0.02s 1.59s 0.90s 87M 21.3K
utoo-npm 5.46s 0.03s 1.43s 0.95s 88M 21.5K
utoo 3.39s 0.04s 5.48s 1.12s 1.02G 136.7K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 25.2K 2.1K 155M 4M 109M - 2M
utoo-next 46.5K 26.7K 16M 2M - 3M 2M
utoo-npm 45.9K 34.3K 16M 2M - 3M 2M
utoo 42.9K 7.7K 150M 3M - 3M 3M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 28.10s 7.15s 5.61s 8.22s 257M 137.1K
utoo-next 47.12s 36.30s 5.61s 11.73s 360M 50.1K
utoo-npm 57.99s 30.01s 6.23s 13.08s 692M 93.2K
utoo 20.27s 0.78s 5.62s 11.81s 404M 57.0K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 26.3K 1.9K 1020M 6M 1.74G 1.74G 2M
utoo-next 168.6K 46.4K 1003M 6M 1.73G 1.73G 3M
utoo-npm 191.4K 82.3K 1004M 6M 1.73G 1.73G 3M
utoo 151.9K 115.8K 1004M 6M 1.73G 1.73G 3M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.27s 0.11s 0.19s 2.30s 141M 32.7K
utoo-next 2.30s 0.09s 0.51s 3.81s 81M 18.4K
utoo-npm 2.37s 0.04s 0.51s 3.84s 82M 19.0K
utoo 2.36s 0.10s 0.50s 3.80s 80M 18.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 314 21 2M 32K 1.87G 1.77G 2M
utoo-next 43.1K 18.4K 21K 10K 1.73G 1.72G 2M
utoo-npm 47.4K 20.5K 22K 13K 1.73G 1.72G 2M
utoo 42.4K 18.7K 22K 27K 1.73G 1.72G 2M

@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · 094cd17 · mac (macos-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 15.37s 1.02s 6.19s 17.23s 802M 51.8K
utoo-npm 21.69s 4.68s 10.40s 25.54s 1.09G 109.2K
utoo 20.15s 2.31s 9.36s 23.40s 919M 68.7K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 16.2K 139.1K - - 1.82G 1.94G 1M
utoo-npm 12.9K 319.3K - - 1.67G 1.87G 2M
utoo 11.6K 311.7K - - 1.64G 1.91G 3M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 2.41s 0.24s 2.70s 1.26s 499M 32.5K
utoo-npm 3.20s 0.25s 3.82s 2.05s 591M 39.6K
utoo 2.41s 0.16s 3.72s 0.92s 627M 41.9K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 17 21.7K - - 112M - 1M
utoo-npm 22 170.3K - - 27M 3M 2M
utoo 21 54.6K - - - - 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 13.77s 0.46s 3.65s 17.20s 538M 35.0K
utoo-npm 15.91s 4.47s 4.39s 20.98s 645M 75.8K
utoo 14.42s 4.14s 3.73s 19.54s 531M 36.2K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 3.9K 132.4K - - 1.73G 1.97G 1M
utoo-npm 1.5K 234.9K - - 1.64G 1.90G 3M
utoo 1.4K 248.3K - - 1.64G 1.90G 3M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 6.39s 0.73s 0.14s 2.72s 53M 4.0K
utoo-npm 4.99s 0.24s 0.59s 3.78s 92M 7.0K
utoo 3.77s 0.39s 0.28s 2.69s 86M 6.5K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 17.2K 787 - - 1.89G 1.94G 1M
utoo-npm 12.7K 72.2K - - 1.64G 1.90G 2M
utoo 15.2K 55.5K - - 1.64G 1.90G 2M

npmmirror.com

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 33.25s 7.72s 7.97s 24.07s 607M 39.3K
utoo-npm 80.23s 32.40s 9.05s 26.45s 807M 79.8K
utoo 34.91s 2.57s 9.89s 21.58s 1000M 67.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 14.6K 160.6K - - 1.80G 1.95G 2M
utoo-npm 1.6K 469.6K - - 1.64G 1.87G 2M
utoo 2.5K 443.4K - - 1.64G 1.87G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 8.07s 4.71s 2.89s 1.53s 568M 37.0K
utoo-npm 0.98s 0.07s 1.45s 0.83s 97M 6.9K
utoo 2.20s 0.01s 3.90s 0.91s 635M 42.7K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 7 23.0K - - 114M - 2M
utoo-npm 4 88.5K - - - 3M 2M
utoo 18 51.1K - - - 3M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 26.16s 1.70s 3.81s 15.20s 233M 15.4K
utoo-npm 82.19s 30.07s 6.48s 21.94s 769M 81.7K
utoo 63.08s 26.53s 5.92s 20.18s 478M 32.2K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 1.7K 164.1K - - 1.68G 1.95G 2M
utoo-npm 1.5K 360.7K - - 1.64G 1.94G 2M
utoo 1.7K 397.2K - - 1.64G 1.94G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.24s 0.09s 0.09s 1.79s 47M 3.6K
utoo-npm 2.87s 0.17s 0.35s 2.36s 95M 7.0K
utoo 3.15s 0.19s 0.33s 2.33s 82M 6.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 12.8K 837 - - 1.81G 1.94G 2M
utoo-npm 12.3K 65.0K - - 1.64G 1.87G 2M
utoo 13.8K 55.0K - - 1.64G 1.87G 2M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Run pm-bench on PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant