perf(pm): optimal — channel resolve on deps, legacy install on p0/p3#2923
perf(pm): optimal — channel resolve on deps, legacy install on p0/p3#2923
Conversation
…down
p1_resolve has been ~0.9s behind bun on phases bench for the past
several PRs. Pcap on prior runs measured bun opening ~260 parallel
TCP streams against registry.npmjs.org for resolve, while utoo
opened ~70 (the 64 manifests-concurrency-limit cap was at saturation).
Adding fetch-breakdown timing in ruborist showed where p1's 22s
(local Mac) actually goes:
fetch-timings: n=2730
sum_request = 1089s (88% — TCP+TLS+HTTP RTT to first byte)
sum_body = 138s (11% — body download)
sum_parse = 2s (0.16% — simd_json on rayon)
The dominant cost is per-request RTT, not parsing or body transfer.
The lever is the cap on concurrent in-flight requests.
This commit:
1. Adds `crates/ruborist/src/util/timing.rs` — process-wide atomic
accumulator that records per-fetch (request_us, body_us,
parse_us, bytes) inside both `fetch_full_manifest` and
`fetch_version_manifest`. Reset before each preload phase, dumped
at INFO level after preload + bfs.
2. Bumps `manifests-concurrency-limit` default 64 → 256 to match
bun's observed working point against npmjs.org.
CI bench will validate. Expected: p1 utoo wall drops toward bun's
range (~2.3s on GHA).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes after the GHA bench on the previous commit (PR #2916, run 25559625024) showed the concurrency=256 hypothesis was wrong on GHA's environment. Revert concurrency 256 → 64 --------------------------- The new fetch-timing instrumentation shipped in the previous commit caught the surprise: GHA's pcap-vs-local profile is the *opposite* of what local Mac measurements suggested. metric local Mac GHA Linux avg_request 399ms 70ms ← network MUCH faster on GHA avg_body 50ms 20ms avg_parse 730µs 266ms ← parse 365× SLOWER on GHA Mechanism: `parse_json_off_runtime` dispatches to `rayon::spawn`, and rayon's pool size is `num_cpus` (= 2 on GHA ubuntu-latest). Bumping concurrency 64 → 256 queued 256 manifest parses behind 2 rayon workers — head-of-line blocking. avg_parse jumped from ~10ms to 266ms wall, dragging p1 utoo wall from 3.10s up to 3.33s. Restore manifest-bench ---------------------- Brought back `crates/manifest-bench` (originally landed in the post-#2818 driver hunt, dropped in af714eb once #2818 graduated). It's a single-binary HTTP-only fetch tool that strips out the ruborist pipeline (no BFS, no dedup, no parse, no project cache, no lockfile write) — fires `GET <registry>/<name>` in parallel and reports the same diag shape as the new `p1-breakdown` lines. Goal: separate the network ceiling from the resolver pipeline so the next round of p1 experiments (parse offload, partial parse, dedicated parse pool, etc.) can be evaluated against a stable "pure network" baseline. Knobs (unchanged from the original drop): --concurrency N sweep without rebuilding utoo --reps N run same workload back-to-back --single-version use /<name>/latest (smaller bodies) --user-agent X UA-fingerprint experiments --http1-only H2 vs H1 toggle --accept X override Accept header Same TLS stack as ruborist (rustls + aws-lc-rs, native roots). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inux build-linux now also builds + uploads `manifest-bench` when a phases bench is going to run (label or dispatch). bench-phases-linux downloads the binary and runs it after the regular phase-isolated benchmark. Sweep mirrors the original (#2818-era) wire-in: concurrency: 32 / 64 / 96 / 128 / 192 / 256 (HTTP/1.1, full manifest) protocol: H1 vs H2-negotiate (cap=128) endpoint: full vs `/<name>/latest` (cap=128, smaller bodies) UA: default vs `Bun/1.2.21` (cap=128) Output goes to /tmp/pm-bench-output/manifest-bench-npmjs.log and ships in the existing pm-bench-logs-linux artifact — no PR comment surface (the headline phases bench comment stays the same). Why now: the new ruborist `p1-breakdown` instrumentation showed sum_parse on GHA can dominate when concurrency is bumped (256: sum_parse 728s vs sum_request 193s). To attribute the bun-vs-utoo gap on p1_resolve we need a "pure HTTP" baseline that strips out ruborist's parse / BFS / dedup / lockfile path. manifest-bench is that baseline: same TLS stack as ruborist (rustls + aws-lc-rs, native roots), no resolver pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI fetch-breakdown on GHA (run 25562552058, conc=64) showed parse
queueing on rayon dominates the gap to manifest-bench's pure-HTTP
baseline:
manifest-bench (pure HTTP, conc=64): 2.12s wall
utoo p1 (full ruborist): 3.10s wall ← +1.0s overhead
↑ sum_parse 95s vs sum_request 95s, parse 50% of work-time
↑ avg_parse 30ms wall vs ~5ms actual CPU — the 25ms extra is rayon
queue wait
Mechanism: 64 concurrent tasks all dispatching parse to rayon's pool
(size = num_cpus = 2 on GHA). Queue depth grows to ~32 per worker.
Each parse waits 25ms+ in queue before running its 5ms of CPU work.
Round 1 fix: inline parse, drop the rayon hop. simd_json on a tokio
worker thread is fast (~5ms for 115KB JSON), and the tokio runtime's
cooperative budget naturally rebalances CPU across the 64 tasks.
Expected on next CI:
- avg_parse drops from 30ms wall → ~5-10ms wall (close to CPU-only)
- preload_wall drops from 5.4s → ~3.5-4s for cold runs
- p1 hyperfine wall drops from 3.10s → 2.3-2.5s, narrowing the gap
to manifest-bench's 2.12s ceiling
If parse becomes the new bottleneck (CPU-bound), next round could
look at partial parse / lazy field access. If wall doesn't drop,
hypothesis is wrong and we look elsewhere (BFS, dedup, lockfile).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 1 (inline parse) reverted on data: GHA showed +0.37s p1 regression because parse blocked tokio runtime workers, dropping eff_parallel 42 → 35 even though per-fetch work-time fell. avg_request went up from 35ms → 52ms — symptomatic of socket reads being delayed by the parsing task on the same worker. metric round 0 (rayon) round 1 (inline) p1 wall 3.27s 3.64s⚠️ +0.37s avg_parse 30ms (queued) 300µs ✓ avg_request 35ms 52ms⚠️ +17ms (worker contention) eff_parallel 42 35⚠️ Round 2 attempts the third option: `tokio::task::spawn_blocking`. - rayon's pool was too small (num_cpus = 2 on GHA) — 64 concurrent parses queued behind 2 workers, parse wall 30ms. - inline parse held tokio worker hostage during simd_json call, starving in-flight socket reads. - tokio's blocking pool has a much larger default cap (512), so 64 concurrent parses never queue. Unlike rayon there's no contention with the install path's parallel-write rayon usage. Unlike inline the tokio runtime workers stay free to drive network I/O. Expected on next CI: - avg_parse drops to ~5-10ms wall (close to CPU floor, no queue) - avg_request stays ~35ms (workers free for I/O) - eff_parallel returns to ~50, possibly higher - p1 wall drops toward manifest-bench's 2.10s ceiling Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 2 moved parse_json_off_runtime off rayon (-0.11s p1). But fetch-breakdown still showed avg_request 41ms vs round 0's 35ms, hinting at a second source of rayon contention. Found it: `extract_core_version_off_runtime` is also on `rayon::spawn`. On npmjs.org's `!supports_semver` path EVERY fetch resolves through `resolve_via_full_manifest`, which fetches the full packument once per package name (deduped via inflight_full) and then calls `extract_core_version_off_runtime` per (name, spec) to materialize the chosen version into a `CoreVersionManifest`. So per fetch we hit rayon TWICE — once for the JSON parse (round 2 moved to spawn_blocking), and once for `get_core_version` (still on rayon). The second hop has the same head-of-line blocking signature as the first: 64 concurrent resolves dispatching to a 2-thread rayon pool. Round 3: move extract_core_version_off_runtime to spawn_blocking for the same reasons. The work is JSON lazy-reparse (`raw_json` sub-tree decoding) — genuinely blocking, well-suited for tokio's blocking pool. Expected: utoo p1 wall drops further toward manifest-bench's 2.10s ceiling. avg_request should fall back from 41ms → ~35ms (rayon contention removed from the fetch task's await chain). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes for round 4 of p1 optimization:
1. Revert `extract_core_version_off_runtime` from spawn_blocking back
to rayon::spawn (round 3). Within-run measurement showed +0.42s
regression vs utoo-next (round 2 was +0.11s). Likely cause: this
function is called per (name, spec), so multi-spec packages call
it 2-5x per fetch. spawn_blocking's per-dispatch overhead exceeds
rayon queue savings at this multiplier.
2. Add `serialize_us` and `cache_export_us` to the p1-breakdown line
so we can attribute the remaining gap. Currently:
manifest-bench wall: 2.10s (pure HTTP ceiling)
utoo p1 wall (round 2): 3.16s
gap: 1.06s
We have:
preload_wall ≈ 2.7s (logged)
bfs_wall ≈ 0.3s (logged)
serialize_us ?
cache_export_us ? ← suspected: full manifest deep-clone
into ProjectCacheData for ~2730 entries
Next round will have data to choose between attacking serialize,
cache export, or the BFS loop body.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 4 measured serialize_us = 15ms and cache_export_us = 34ms — both
tiny — confirming the 1s gap from manifest-bench (utoo p1 = 3.16s vs
mb wall = 2.10s) is not in post-build code.
Per-fetch math also pointed at main-loop bookkeeping:
manifest-bench: eff_parallel = 52 (sum_work 111s / wall 2.14s)
utoo preload : eff_parallel = 43 (sum_work 120s / wall 2.85s)
Same conc=64 cap, but utoo loses 9 effective slots — most likely
the main loop's serial bookkeeping (dedup hash insert, format!
key, extract_transitive_deps, queue push, 3-4 receiver events)
holds the flow between futures.next() returning and the next
fetch dispatch.
This commit splits the main loop into two timed segments:
preload_loop_dispatch_us: time spent in the `while in_flight <
concurrency` block — popping pending,
dedup check, futures.push.
preload_loop_result_us: time spent processing each completed
future — extract_transitive_deps,
pending.extend, on_manifest.
If dispatch+result sum approaches preload_wall, the main loop is
the bottleneck and we need to either (a) split processing onto a
dedicated task, or (b) use unbounded futures with a downstream
consumer. If they're small, the gap is elsewhere (per-task
overhead in resolve_package's inflight gates).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 5 main-loop instrumentation showed the preload main loop itself is fast (15-25ms total dispatch+result). The 0.8s gap from manifest-bench's 2.10s wall lives INSIDE the spawned fetch tasks. Per-fetch wall (warm runs): measured: avg_request 30ms + avg_body 6ms + avg_parse 2.5ms = ~38ms derived: preload_wall 2.4s × eff_parallel(43) / 2730 = 38ms delta: ~12ms unaccounted per task That 12ms is `extract_core_version_off_runtime` queueing on rayon's 2-thread pool. extract is called per (name, spec) — for ant-design that's ~3000+ calls. With pool=2 and 64 concurrent fetches each dispatching extract, the queue depth grows; each task waits its turn before extract returns. Bump rayon pool to `max(num_cpus, 8)` for non-Windows. Sizing the pool above the CPU count for short blocking JSON ops (parse + extract) replaces FIFO queueing with parallel dispatch. Real CPU contention is bounded by num_cpus (the kernel scheduler still gates), so the extra pool threads just hold ready-to-run dispatches in parallel rather than serialised in a queue. Why not just spawn_blocking (round 3 attempt): tokio's blocking pool defaults to 512 threads, but its per-dispatch overhead was higher than rayon's even when queueing — round 3 regressed by 0.5s. Expected: extract queue wait drops from ~12ms to ~1-2ms wall, p1 preload_wall narrows toward manifest-bench's 2.10s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `BuildDepsOptions::skip_preload` so callers without a pipeline consumer (utoo deps / package-lock-only) can drop the up-front preload phase entirely. BFS now batches prefetch per level across the whole frontier, then runs the existing sequential process_dependency walk against the warmed cache. For install paths (Context::pipeline_deps_options), skip_preload stays false so PackageResolved events still feed the download/clone pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds resolver::fast_preload, a manifest-bench-style flat FuturesUnordered over service::manifest::fetch_full_manifest. It warms MemoryCache (both full_manifests and version_manifests slots) synchronously after each fetch, so the BFS phase is pure cache-hit: no rayon hop on extract_core_version, no OnceMap gates, no DiskManifestStore writes, no PackageResolved events. Wired into service::api::build_deps: when the caller asks to skip preload (Context::build_deps for `utoo deps`) and there's no warm project cache, fast_preload runs ahead of build_deps_with_config. Install paths still go through preload_manifests so the pipeline keeps its early-start signal. Also reverts the per-level prefetch I added in 394f6c9 — with fast_preload pre-warming everything, BFS doesn't need its own prefetch wave. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v1 of fast_preload called settle_spec inline on the tokio worker — each settle ran simd_json::to_borrowed_value over the full manifest's raw bytes (5–10ms per spec) right on the runtime thread. CI showed it starved sibling fetches: avg_request rose +3ms, avg_parse jumped 5→11ms, p1_resolve regressed +1.0s vs the preload+BFS baseline (4.0s vs 3.0s). Fix: route every settle through extract_core_version_off_runtime (the same rayon::spawn helper the BFS path uses), and merge fetch and settle completions into a single FuturesUnordered so backpressure on either side throttles the other. Sibling specs that arrived during a fetch are now stashed by name (HashMap, not linear scan), then dispatched as their own settle futures when the fetch lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone manifest-bench HTTP-only sweep (npmjs, h1) shows wall bottoming at concurrency=96 (1817ms) — earlier 256 regression was caused by rayon-queued parses behind 2 workers, no longer relevant since fetch parse is on spawn_blocking and settle is rayon-dispatched off the runtime. fast_preload's wave-shaped transitive walk currently runs at eff_parallel ~35 against the 64 cap because pending refills lag settles; raising the cap to 96 gives headroom for sustained in-flight on the deep waves without crossing the npmjs per-IP tail-latency cliff that conc 128+ trips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… path `UnifiedRegistry::resolve_version_manifest`'s first cache check (service/registry.rs:347) keys on `(name, spec)` — the original spec string the caller passed, e.g. `^4.0.0`. settle_future was only populating `(name, resolved_version)` (e.g. `4.17.21`), so on every BFS edge for `lodash@^4.0.0`-style specs the warm path missed and fell into the OnceMap inflight gate + `resolve_via_full_manifest` re-walk before recovering the manifest from the `(name, resolved_version)` slot we'd already set. Now settle writes both keys so BFS hits the early-return at service/registry.rs:347 with no further dispatch. Saves ~1 OnceMap+resolve_target_version round-trip per unique (name, spec) the BFS encounters (≈3000 calls on ant-design-x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous fast_preload (v2) dispatched primary settles to rayon as separate FuturesUnordered futures. CI breakdown showed eff_parallel ~44 against the conc=96 cap — the wave-shaped transitive walk was held back by settle dispatch RTT: each fetch landed → primary settle queued → settle popped → only then did `pending` get transitive deps and fill the next dispatch wave. v3 folds the primary settle into the fetch task itself via `tokio::task::spawn_blocking`. The fetch task does the network round-trip and the primary version-extract on the same blocking pool slot, then returns with the resolved CoreVersionManifest attached. Main loop pulls one Fetched event, immediately extends `pending`, no second `next().await` to wait through the queue. Sibling specs (rare; same name, different range) still go through the rayon settle_future path so the primary path stays lean. Carries primary_spec through FastEvent so the fused path can populate both `(name, primary_spec)` and `(name, resolved_version)` cache slots — preserves the 6455852 BFS fast-path win. FetchOutcome enum replaces by-value FetchManifestResult to avoid a full FullManifest clone (HashMap+Vec) per fetch event. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…json
The fast_preload hot path was paying TWO simd_json passes per
manifest:
1. fetch_full_manifest's parse_json_off_runtime did a typed
simd_json::serde::from_slice<FullManifest> (envelope + IgnoredAny
visitor on `versions` keys, ~3-5ms on a 100KB body).
2. Primary settle re-parsed the same raw bytes with
simd_json::to_borrowed_value (~5-10ms) to extract one version's
subtree.
Both passes went through simd_json's Tape constructor — duplicated
work. CI showed avg_parse 5-7ms × 2700 fetches = 14-19s of CPU sum
on 2-core GHA, where the spawn_blocking pool's overlapping schedule
masked some of the cost but not all.
Adds `service::manifest::fetch_full_manifest_with_settle`: same HTTP
+ retry + ETag machinery as `fetch_full_manifest`, but the parse
step does ONE `to_borrowed_value` and extracts:
* envelope (`name`, `dist-tags`, `versions` keys) into FullManifest
manually (no typed serde), and
* the resolved version's subtree as a typed CoreVersionManifest
(serde-deserializing that single subtree via the borrowed value).
fast_preload's fetch task switches to this entry point — primary
settle is now a free byproduct of the fetch parse, not a separate
`to_borrowed_value` pass. Sibling specs (same name, different
range) still go through the rayon settle_future path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After 671ac98's combined-parse fetch path eliminated the double simd_json pass, the spawn_blocking pool's contention ceiling rose enough that bumping concurrency past 96 no longer queues parses behind 2-core CPU. manifest-bench's most recent good-network sweep on GHA showed conc=128 hitting 1500ms vs conc=96 at 1566ms — small but real headroom for fast_preload's late-wave saturation now that initial waves fill faster. Risk: on slower-network runs (npmjs per-IP throttle), conc=128 widens p99. Earlier conc-sweep data was mixed — accepting that variance for the average-case improvement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
542d7f1's conc=128 bench landed in a slow-network run (mb best 2010ms vs 1500ms in the prior good-network run; bun also bumped to 2.14s vs 1.83s). Adjusted gap to mb best stayed flat (~700ms either way), so conc=128 didn't beat 96 across runs. Picking 96 as the conservative default: at-or-near best on every GHA run we've measured, never the worst, and leaves headroom for npmjs's per-IP throttling to absorb without compounding p99. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…preload)
Adds resolver::mb_resolve module + service::build_deps_mb entry point
as a parallel-track alternative to fast_preload, structured to
match manifest-bench's main-loop shape as closely as correctness
allows. Hypothesis under test: fast_preload's eff_parallel caps at
~50/96 because the FastEvent enum match + cache writes + sibling
deferred bookkeeping in the main loop competes with tokio runtime
workers for the 2 CPU cores on GHA, stalling socket I/O drive.
mb_fetch pushes ALL per-fetch work into the spawned future itself
(including cache writes), so the main loop is reduced to:
while let Some(deps) = futs.next().await {
pending.extend(deps);
refill_to_cap(...);
}
Sibling specs (multiple ranges on same package) are NOT deferred at
queue level — racing fetches for the same name both proceed. The
race converges naturally: first fetch to land populates
full_manifests, subsequent racers find the cache hit on entry and
short-circuit to a sibling-style settle. Wastes ~5-50 network
requests in real workloads but eliminates the HashMap probe + drain
overhead from the hot loop.
Wired in via UTOO_RESOLVE=mb env var:
- Context::build_deps (utoo deps) routes through build_deps_mb
- pipeline::resolve_with_pipeline (utoo install) also routes
through it; pipeline workers still start but don't pipeline
during fetch (mb_fetch emits no PackageResolved events) — install
becomes phase-sequential, useful for resolve-phase A/B.
bench script enables UTOO_RESOLVE=mb so CI measures the new path
against existing baselines (utoo-next/utoo-npm/bun ignore the env
var). Comment the export line to A/B back against fast_preload.
Old fast_preload + UnifiedRegistry paths untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v1/v2 ran parse work in spawn_blocking inside each fetch future, which competed with tokio runtime workers for the 2 GHA cores. CI showed eff_parallel capped at 47/96 vs manifest-bench standalone's 75/96 on the same box. Hypothesis: parse CPU starves socket drive. v3 separates the two phases: * PHASE 1 — `mb_style_pure_fetch` is a structural copy of `manifest-bench`'s main loop: future body does ONLY GET + body recv, refill 1-for-1 on completion. Zero per-future CPU work, so tokio runtime workers retain full CPU for socket drive. * PHASE 2 — bulk rayon par_iter parse: for each body, parse `FullManifest` envelope via simd_json::to_borrowed_value, resolve every queued spec for this name against the just-parsed manifest, populate cache slots, collect transitive deps. Runs off the tokio runtime entirely (spawn_blocking → rayon par_iter). Phases alternate until pending exhausted. Typical project: 3-5 iterations as the dep tree fans out wave by wave. The point of the split is the `phase1_http_wall` trace — measured in isolation from any parse work, it should match manifest-bench's standalone wall (~1.5-2.0s for 2733 names @ conc=96). If it does, the remaining gap to mb is concentrated in phase 2 work, which is inherent to discovering transitive deps from a non-flat name list. Tracing per iteration: p1-breakdown mb_fetch iter=N phase1_http_wall=Xms n=Y bytes=Z p1-breakdown mb_fetch iter=N phase2_parse_wall=Xms settles=Y new_transitives=Z p1-breakdown mb_fetch total_wall=Xms iters=Y Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v3 dropped the (name, spec) HashSet from v1/v2 thinking name-level dedup via done_names was sufficient. It wasn't: sibling-settle's extract_transitive can re-introduce specs we've already settled (peer/optional dep cycles trivially trigger this), so the outer while-loop never terminated. CI 25589397823 hung on `Run phase-isolated benchmark · npmjs` for ~25 min before being cancelled — the bench's first utoo p1_resolve hyperfine run got stuck in an infinite settle loop. Fix: maintain `seen_specs: HashSet<(String, String)>` across all iterations; filter both initial seed and every wave of new transitives through it before extending pending_specs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New crate `crates/preload-bench/` is a fully-standalone bench that: * Uses the SAME HTTP setup as `manifest-bench` (own reqwest::Client built per rep with aws-lc-rs TLS, pool_max_idle_per_host(256), no proxy, default DNS, no retry, h1_only). * Discovers names by walking transitive deps from a package.json root — instead of consuming a flat name list like manifest-bench. * Per-future does GET + body recv + spawn_blocking parse → returns transitive deps → main loop refills on completion. * No dependency on ruborist or any utoo internals (own simd_json, own dedup, own everything). The point: prove (or disprove) that a fully ruborist-independent streaming preload can hit standalone manifest-bench's wall on the same workload. ruborist's path runs at ~2.18s for ant-design's ~2700 names; manifest-bench standalone runs the same workload at ~1.6s. The gap could be in any number of things — DNS layer, retry, pool config, parse-CPU contention, registry single-flight gates. preload-bench eliminates all of those simultaneously so we can read the wall directly. Wired into bench-phases-linux: builds + uploads preload-bench binary alongside manifest-bench, then runs a conc=64/96/128 sweep against the same project after the standalone manifest-bench sweep. bench script reverts UTOO_RESOLVE=mb so utoo runs default fast_preload — gives a third datapoint (utoo wall on integrated path) alongside manifest-bench (HTTP-only ceiling) and preload-bench (streaming-with-walk ceiling). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y path
Step 1 of staged service-layer ablation. Rewrites mb_resolve as a
fully self-contained streaming preload mirroring preload-bench's
loop shape verbatim, but living inside ruborist so it can populate
MemoryCache for the BFS phase.
Bypasses every other ruborist service layer:
* service::http::get_client — own reqwest::Client built per call,
no global LazyLock, no shared_resolver dns layer, no
connect_timeout, pool_max_idle_per_host(256).
* service::manifest::fetch_full_manifest_with_settle — own GET +
body.bytes() + spawn_blocking(simd_json::to_borrowed_value),
no RetryIf, no FETCH_TIMINGS.
* service::registry::UnifiedRegistry — no OnceMap, no
ManifestStore, no EventReceiver.
Only service::* touched is MemoryCache writes (DashMap inserts) so
BFS has data to read from.
PM is unaware: dispatch happens entirely inside
service::api::build_deps when skip_preload=true and no warm cache.
Removes the previous UTOO_RESOLVE=mb env-var gating from
pm::helper::ruborist_context::Context::build_deps and
pipeline::resolve_with_pipeline. Removes the now-unused
service::api::build_deps_mb sibling entry point.
Expected: utoo p1_resolve drops from ~2.67s toward preload-bench's
~2.57s (or better since ruborist fetches fewer names than
preload-bench). The remaining gap to mb's ~1.99s would isolate
incremental layer effects we add back next:
- tokio runtime config / cooperative scheduling
- reqwest::Client provider differences (TLS, DNS)
- cache layer (DashMap vs DiskManifestStore reads on the cold path)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mb_fetch
Step 2 of staged service-layer ablation. Targets the two gaps
left after step 1:
1. mb_fetch (in ruborist): 2300ms / 2735 = 0.84 ms/name
manifest-bench (standalone): 2010ms / 2735 = 0.72 ms/name
~290ms gap on same workload, same conc.
2. BFS phase: 305ms wall against a fully-warm MemoryCache.
Origin unclear — could be graph mutations, repeated cache
lookups via the inflight gate, or event dispatch.
Changes:
* TLS provider — adds rustls (aws-lc-rs) + rustls-native-certs to
non-wasm-non-macos targets. mb_resolve's `build_mb_client` now
uses `use_preconfigured_tls(aws_lc_rs)` matching
preload-bench / manifest-bench exactly. The reqwest crate's
`rustls-tls-native-roots` feature on Linux still bundles ring
for service::http's global client; the two providers coexist.
* mb_fetch instrumentation — per-future `wall_us` (network +
parse + cache writes) and `net_us` (network only) reported in
the trace line as `eff_par_full`, `eff_par_net`, `avg_wall`,
`avg_net`. Same shape as manifest-bench's `avg_conc` so we can
compare directly.
* BFS instrumentation — splits run_bfs_phase wall into:
- `collect_us`: collect_unresolved_edges sum
- `resolve_us`: process_dependency .await sum
- `event_us`: post-resolve event dispatch (Resolved /
PackagePlaced / Reused / Skipped) sum
Plus `levels` and `edges` counters. Trace line lets us
attribute the 305ms.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 3 of staged service-layer ablation. Targets the 305 ms BFS phase observed against a fully-warm MemoryCache — 100 % attributed to process_dependency.await sum (graph mutations) per d9fb207's new bfs instrumentation. Adds: * `process_dependency_with_resolved` in builder.rs — sync variant of process_dependency for the registry-resolved case. Skips spec-routing (only Registry handled), skips resolve_registry_dep (resolved is the parameter), skips override re-resolve. Reuses existing helpers (find_compatible_node, create_package_node, add_edges_from, mark_dependency_resolved, update_node_type_from_edge). * `mb_fetch_with_graph` in mb_resolve.rs — folded streaming preload + graph build. Each fetch result triggers inline process_dependency_with_resolved for every parent edge waiting on (name, spec). New nodes' edges feed back into pending / edge_targets, so the walk continues streaming-style. CPU work (graph mutations, ~305 ms total) overlaps with network IO (mb_fetch's wall ~2.4 s). Wires `service::api::build_deps` to use mb_fetch_with_graph for the lockfile-only path (skip_preload + cold cache). The follow-up build_deps_with_config still runs to handle any non-registry edges left unresolved (workspace / git / http / file); on registry-only workloads it's near no-op. Install path unchanged — pipeline_deps_options keeps preload + PackageResolved early-start signal for tgz download. Expected: utoo p1 wall drops from ~2.76 s toward mb_fetch wall + serialize ≈ 2.4-2.5 s on good network. Tracing line: p1-breakdown mb_fetch_with_graph wall=Xms ok=N fetch=N settle=N sum_wall=Xms sum_net=Xms sum_graph=Xms avg_net=Xus eff_par_full=N.N eff_par_net=N.N unresolved_targets=N Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c02bb15 had unresolved_targets=583 in trace — `enqueue_node_edges` was unconditionally pushing (parent, edge_id) into edge_targets without checking if the (name, spec) was already cached. When a later transitive's edge referenced an already-fetched (name, spec), no fetch result would land to drain that bucket — the parent edges sat unresolved, potentially missing packages from the lockfile. Fix: enqueue_node_edges now checks cache.get_version_manifest first. Cache hit → process_dependency_with_resolved inline (with a work_stack to recurse into newly-Created nodes' edges). Cache miss → original behavior (stash in edge_targets, push to pending). Side effect: more inline graph mutation work in the seed phase (workspace + root edges that hit warm cache from previous specs in the same root). Should reduce the number of fetch-result events that need to do graph mutations downstream, since orphan edges no longer accumulate. Targets the correctness bug from c02bb15 trace; perf impact TBD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 700ms gap between utoo p1 (folded mb_fetch_with_graph) and
manifest-bench standalone needs network-layer evidence. Same
workload, same conc, same network → why does utoo wall trail by
700ms when per-fetch latency is matched (avg_net=53us = mb p50=40us
ish)?
Hypotheses to test via pcap diff:
* Fewer concurrent TCP streams in flight at any moment (utoo's
main loop CPU steals tokio dispatch capacity → in-flight count
drops below conc cap)
* More TLS handshakes (utoo's connection pool isn't reusing as
effectively as mb's per-rep fresh client)
* Larger inter-packet gaps per stream (utoo's runtime pauses mid
download)
* Different concurrent-stream-time profile (wave shape)
Adds two captures at end of pm-bench-pcap.sh:
manifest-bench-c96 — flat lockfile-derived names @ conc=96
preload-bench-c96 — transitive walk @ conc=96 (matches utoo's
walk shape, but no graph build)
Each captured with the same tcpdump + iostat as the existing
utoo / utoo-next / bun captures. analyze_pcap globs *.pcap so the
new files get the same TCP signal extraction (zwin / retx /
dup_ack / per-stream gap p50/p99/max / distinct streams).
Workflow: downloads manifest-bench-linux-x64 +
preload-bench-linux-x64 artifacts (built by build-linux's
benchmark-label conditional steps) into the pm-bench-pcap-linux
job env so pm-bench-pcap.sh can find them.
Trigger: workflow_dispatch with target=pm-bench-pcap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous pm-bench-pcap artifact was 2GB (raw .pcap files for every PM × phase × bench), making the round-trip download impractical just to read JSON metrics. Adds a separate `pm-bench-pcap-summaries` artifact containing only the *.json / *.log / *.iostat.txt / dns.txt files — KB scale, downloads in seconds. Raw pcap artifact is preserved for cases where we want to re-run tshark with different filters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pm-bench-pcap artifact is ~2 GB (pcap binaries dominate). gh run download keeps timing out before completion. Two fixes: 1. New `pm-bench-pcap-summaries` artifact uploads only the JSON summaries + .log + iostat.txt + dns.txt (small, fast download). The full pcap artifact stays for deep inspection when needed. 2. End of pm-bench-pcap.sh prints a tab-separated comparison table (name, wall_s, packets, streams, zwin, retx, dup_ack, gap_p99_us, gap_max_us) to stdout, so the data is visible in the CI run log without downloading anything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…raph The pcap evidence (utoo-resolve zwin=71 vs mb-c96 zwin=49) confirmed main loop CPU was starving tokio runtime workers from polling sockets. Inline graph mutations (sum_graph=450ms across the fetch loop) blocked the worker between awaits, so TCP receive buffers filled and the server paused sending — directly extending wall. This refactor: * Spawns `graph_worker` as a separate tokio task (gets its own runtime worker thread on multi-thread runtime). Owns the DependencyGraph + edge_targets + seen_specs. * Main loop owns FuturesUnordered + body_cache + dispatch state. No graph mutations on this path. * mpsc channels: main → graph (FetchEventMsg, just the name — cache writes already in the future), graph → main (Vec<Dep> new pending specs to extend the fetch queue). * `tokio::select!` with `biased` drains specs first to unblock fetch dispatch. * `in_flight_graph` counter tracks outstanding messages to graph worker — termination = futs empty + in_flight_graph == 0. Function signature changed: takes `mut graph: DependencyGraph` by value, returns `(DependencyGraph, MbFetchStats)` since the worker task needs ownership of the graph (can't borrow across spawn). api.rs caller threads the graph through. Expected: zwin drops back toward mb's ~49 (no more main loop starvation), eff_par_net climbs from 56 toward mb's 72, wall saves ~200ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plumb the PipelineReceiver through the folded mb_fetch_with_graph path so install (`utoo install`) gets the same channel-separated fetch + graph architecture as `utoo deps`, with download/clone pipelines starting as early as the legacy preload+BFS path: - mb_fetch_with_graph now takes Arc<R: EventReceiver + 'static>; main loop emits PackageResolved on each fetch land (looked up via cache with the new FetchOutcome::primary_spec), graph_worker emits PackagePlaced on ProcessResult::Created. - service::api::build_deps wraps the caller-supplied receiver in Arc once and shares it between mb_fetch_with_graph and build_deps_with_config; adds + 'static bound on R. - pipeline_deps_options sets skip_preload=true so install routes through the same folded path as the lockfile-only command. CI will validate that p1 resolve continues at/below 2.5s while p0_full_cold and p3_cold_install do not regress (download + clone pipelines remain saturated via emitted events). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously spawn_fetch / spawn_settle used the raw dep key as both the registry path segment and the cache lookup key. For an npm-alias dep like \`\"ms\": \"npm:raw-body@2.1.3\"\` this hit \`registry/ms\` instead of \`registry/raw-body\`, parsed ms's manifest against \`npm:raw-body@2.1.3\`, and ultimately installed the real ms into \`node_modules/ms/\` rather than raw-body. e2e \`utoo-pm.sh:466\` (\"top-level ms should be raw-body\") caught this on d1cf53e. Fix: - spawn_fetch / spawn_settle call \`normalize_spec\` to split out the real package name + spec; URL hits \`registry/{real_name}\` and the combined parse runs against \`real_spec\` so version resolution sees the right manifest envelope. - Cache writes go under both keys: the original \`(alias_name, alias_spec)\` so \`graph_worker\` finds the manifest via \`edge_targets\`, and the normalized \`(real_name, resolved_version)\` for direct-dep dedup. - Main loop dedup state (in_flight_names / deferred_by_name / body_cache) keys by real_name so two distinct aliases pointing at the same registry package share dedup; deferred entries store \`(alias_name, spec)\` so the drain spawns spawn_settle with the correct cache key. - Adds \`real_name\` to FetchOutcome so the deferred-drain step can look up by real name without re-normalizing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GHA ubuntu-latest is a 2-core runner; tokio's default worker_threads = num_cpus = 2. The install hot path multiplexes four concurrent task families on those workers: - mb_fetch_with_graph main loop (drives sockets + FuturesUnordered) - graph_worker (\`tokio::spawn\`, CPU-heavy) - pipeline download workers (PackageResolved → tarball fetch) - pipeline clone workers (PackagePlaced → hardlink/clonefile) Under that load, graph_worker can monopolize a worker thread for tens of ms at a stretch, starving the main loop's socket polling. The symptom is a wave-shape collapse (eff_par_full 73-77 → 40, mb_fetch wall 4-6s → 10s+) that pushes p0_full_cold tail by 3-5s on the affected run. Floor worker_threads at \`max(num_cpus, 4)\` so the runtime always has headroom to keep the resolve hot path on its own worker even when the install pipeline saturates the others. No-op on 4+ core machines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
graph_worker is pure CPU + channel IO; on a multi-thread tokio runtime it sat on a worker thread that the install pipeline (download + clone + extract) and the resolve main loop were also competing for. Under the 2-core GHA ubuntu runner that contention produced the eff_par_full collapse (73-77 → 40) and 4-6s → 10s+ mb_fetch wall on the p0/p1 outlier runs. Move it to \`tokio::task::spawn_blocking\`: - Convert graph_worker from \`async fn\` to sync \`fn\`; channel IO uses \`mpsc::Receiver::blocking_recv\` and \`mpsc::Sender::blocking_send\`, which are tokio-supported when called outside an async context. - Spawn site wraps it in a \`move ||\` closure so spawn_blocking owns the captured state. \`graph_handle.await\` keeps the same shape — a spawn_blocking JoinHandle is awaitable. The blocking pool defaults to 512 threads, so reserving one slot for graph_worker has no scheduling effect on download/clone/extract spawn_blocking calls. Net effect: the resolve worker can no longer be preempted by graph mutation bursts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Optimal-hypothesis A/B from PR #2920's analysis: PR #2920's install integration sets \`skip_preload=true\` on \`pipeline_deps_options\`, so \`utoo install\` routes through channel \`mb_fetch_with_graph\` and emits PackageResolved at full fetch rate. Bench data showed this floods the download pipeline with concurrent tarball requests that fight extract workers for blocking-pool slots, net p3_cold_install regression of ~0.7s vs the legacy install path (5.95s) and Plan B's inline+spawn_blocking variant (5.75s). Drop \`skip_preload=true\` from \`pipeline_deps_options\`. \`utoo install\` now routes through legacy preload+BFS, recovering the p3 win. \`utoo deps\` (lockfile-only) still hits channel \`mb_fetch_with_graph\` via \`Self::build_deps\` which sets \`skip_preload=true\` independently — keeps the p1 = 2.78s win. Expected outcome: - p1 (utoo deps): unchanged from PR #2920 ≈ 2.78s - p3 (utoo install): reverts to legacy install ≈ 5.95s, vs PR #2920's 6.70s - p0 (full cold): roughly Plan A's 7.95-8.79s Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces significant performance optimizations to the manifest resolution and preloading pipeline, specifically targeting lockfile-only workloads. It adds standalone benchmarking tools, manifest-bench and preload-bench, to isolate and measure network and resolution overhead. Key improvements include a folded streaming graph build that integrates preloading with graph construction, switching JSON parsing to tokio::task::spawn_blocking to prevent runtime worker starvation, and increasing default fetch concurrency. Detailed instrumentation has been added to provide per-phase timing breakdowns. Review feedback highlights a bug in the success/fail metric tracking for fetches, a potential O(N) performance bottleneck in edge target processing, and an opportunity to eliminate redundant buffer clones during JSON parsing.
| // Empty result from a fetch is ambiguous (no transitives | ||
| // OR a fetch/parse failure). Track conservatively as | ||
| // success — the FETCH_TIMINGS-equivalent counter is | ||
| // omitted in this path on purpose to keep the future | ||
| // body lean. | ||
| stats.success += 1; | ||
| } else if out.fetched { | ||
| stats.success += 1; | ||
| } |
There was a problem hiding this comment.
The logic for tracking successful fetches is flawed. If a request fails (e.g., network error resulting in status == 0) or returns a non-200 status (e.g., 404), transitives will be empty and fetched will be true. The current code increments stats.success in both cases, leading to inaccurate performance metrics in the logs.
| // Empty result from a fetch is ambiguous (no transitives | |
| // OR a fetch/parse failure). Track conservatively as | |
| // success — the FETCH_TIMINGS-equivalent counter is | |
| // omitted in this path on purpose to keep the future | |
| // body lean. | |
| stats.success += 1; | |
| } else if out.fetched { | |
| stats.success += 1; | |
| } | |
| if out.fetched && out.status == 200 { | |
| stats.success += 1; | |
| } else if out.fetched { | |
| stats.fail += 1; | |
| } |
| let primary_keys: Vec<(String, String)> = edge_targets | ||
| .keys() | ||
| .filter(|(n, _)| n == &msg.name) | ||
| .cloned() | ||
| .collect(); |
There was a problem hiding this comment.
Iterating over all keys in edge_targets for every FetchEventMsg is an HashMap<String, Vec<String>>) to quickly retrieve all specs associated with a package name.
| ) -> Result<(FullManifest, Option<PrimarySettleResult>)> { | ||
| use simd_json::prelude::{ValueAsScalar, ValueObjectAccess}; | ||
|
|
||
| let mut buf = (*raw).to_vec(); |
There was a problem hiding this comment.
Cloning the entire manifest buffer into a Vec<u8> for simd_json is expensive on the hot path. Since simd_json requires a mutable buffer for in-place unescaping, and the raw bytes are stored in an immutable Arc<[u8]> within FullManifest, this copy is currently necessary. However, if performance remains a concern, exploring ways to pass ownership of the initial Vec<u8> from the network response directly to the parser before wrapping it in an Arc could eliminate this allocation.
The previous \`max(num_cpus, 8)\` rayon pool was tuned for resolve-path manifest parse but the install-path's \`par_chunks(64)\` tarball extract also rides the global rayon pool. Bench A/B on PR #2923 (channel resolve + legacy install) showed: - mac run1 (clean CI): utoo p3 9.49s ±0.19, beats utoo-npm 12.64s - mac run2 (busier CI): utoo p3 14.18s ±0.42, loses to utoo-npm 12.56s by 1.6s — utoo regressed +4.7s while utoo-npm only +-0.08s and bun +0.8s linux 3 runs all show p3 1-3s gap vs utoo-next. Pattern: utoo oversubscribes 8 rayon workers × par_chunks(64) = 512 in-flight disk writes, which is fine on idle hardware but compounds under load. Resolve-path doesn't actually need the bigger rayon pool: parse already runs through \`tokio::task::spawn_blocking\` (512-slot pool) in \`mb_resolve.rs\`, not rayon. The bigger rayon pool was solving a problem in a path that was later swapped to a different mechanism. Restore origin/next behavior: rayon at default \`num_cpus\` on Unix, 8MB stack on Windows for libdeflater. Expected outcome: - p1: unchanged ≈ 2.66s (resolve uses tokio blocking pool, not rayon) - p3: recovers to Plan A's ~5.95s on linux, ~9.5s on mac stably - p0: improves vs prior PR #2923 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2916/#2920 switched manifest parse from \`rayon::spawn\` to \`tokio::task::spawn_blocking\` on the theory that rayon's small pool (num_cpus = 2 on GHA) added queue wait at 64× concurrent fetches. That tradeoff worked for the channel mb_fetch_with_graph hot path (\`utoo deps\`), but it routed the legacy preload+BFS install path's manifest parse through tokio's blocking pool — the same pool used by download / clone workers on \`utoo install\`. With 4647 deps × parse × 2 hops, the contention with install IO workers added ~2s on p3_cold_install on PR #2923 (channel resolve + legacy install). Bench data on PR #2923 + rayon-pool revert (sha 39c8966): - linux clean run: utoo p3 = 8.39s ±0.18 vs utoo-next p3 = 6.00s ±0.27 — utoo +2.39s gap with σ < 0.3 on both, ruling out CI noise - utoo p1 = 2.79s avg (channel mb_fetch_with_graph still wins) Restoring rayon::spawn keeps parse on its own pool (small but isolated from install IO workers). Channel mb_fetch_with_graph in mb_resolve.rs uses \`tokio::task::spawn_blocking\` directly for its own combined parse so this revert doesn't affect the resolve hot path's p1 numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📊 pm-bench-phases ·
|
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 9.19s | 0.23s | 10.36s | 10.35s | 769M | 339.4K |
| utoo-next | 8.25s | 0.30s | 10.69s | 12.46s | 989M | 126.0K |
| utoo-npm | 9.13s | 1.04s | 11.26s | 12.89s | 1.38G | 184.2K |
| utoo | 8.17s | 0.29s | 10.63s | 12.23s | 1007M | 123.0K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 13.7K | 16.8K | 1.20G | 6M | 1.89G | 1.77G | 1M |
| utoo-next | 134.4K | 97.2K | 1.17G | 5M | 1.73G | 1.73G | 2M |
| utoo-npm | 139.6K | 105.3K | 1.17G | 5M | 1.73G | 1.73G | 2M |
| utoo | 127.9K | 74.5K | 1.17G | 5M | 1.73G | 1.73G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 1.90s | 0.04s | 4.04s | 1.03s | 484M | 163.0K |
| utoo-next | 3.16s | 0.05s | 5.52s | 1.97s | 614M | 88.5K |
| utoo-npm | 3.11s | 0.04s | 5.56s | 1.94s | 607M | 79.1K |
| utoo | 2.66s | 0.10s | 5.37s | 1.06s | 1.12G | 162.8K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 7.8K | 4.6K | 204M | 3M | 108M | - | 1M |
| utoo-next | 69.0K | 115.3K | 201M | 2M | 7M | 3M | 2M |
| utoo-npm | 68.7K | 113.7K | 201M | 2M | 7M | 3M | 2M |
| utoo | 38.8K | 8.4K | 201M | 2M | - | 3M | 3M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 6.59s | 0.14s | 6.39s | 10.02s | 615M | 208.4K |
| utoo-next | 6.05s | 0.12s | 4.94s | 10.83s | 479M | 58.1K |
| utoo-npm | 6.52s | 0.17s | 5.49s | 11.21s | 937M | 107.5K |
| utoo | 5.86s | 0.25s | 5.03s | 10.69s | 501M | 67.0K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 3.4K | 6.7K | 1.00G | 3M | 1.78G | 1.78G | 1M |
| utoo-next | 90.0K | 44.2K | 999M | 2M | 1.72G | 1.72G | 2M |
| utoo-npm | 100.5K | 60.5K | 999M | 2M | 1.72G | 1.72G | 2M |
| utoo | 85.2K | 44.5K | 999M | 2M | 1.72G | 1.72G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.27s | 0.08s | 0.20s | 2.39s | 139M | 33.1K |
| utoo-next | 2.51s | 0.38s | 0.53s | 3.86s | 80M | 18.1K |
| utoo-npm | 2.46s | 0.08s | 0.55s | 3.92s | 85M | 19.6K |
| utoo | 2.41s | 0.01s | 0.52s | 3.87s | 81M | 18.6K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 299 | 24 | 5M | 38K | 1.93G | 1.77G | 1M |
| utoo-next | 40.8K | 18.1K | 306K | 12K | 1.73G | 1.73G | 2M |
| utoo-npm | 47.4K | 21.5K | 319K | 26K | 1.73G | 1.73G | 2M |
| utoo | 43.9K | 21.1K | 306K | 12K | 1.73G | 1.73G | 2M |
npmmirror.com
p0_full_cold
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 33.22s | 1.92s | 9.67s | 10.47s | 534M | 396.4K |
| utoo-next | 28.63s | 0.53s | 7.41s | 13.81s | 510M | 76.5K |
| utoo-npm | 31.68s | 1.68s | 7.91s | 14.94s | 761M | 132.7K |
| utoo | 21.32s | 0.48s | 7.38s | 13.63s | 519M | 85.8K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 110.3K | 3.4K | 1.16G | 13M | 1.89G | 1.77G | 2M |
| utoo-next | 233.4K | 109.0K | 1.01G | 10M | 1.73G | 1.73G | 2M |
| utoo-npm | 242.0K | 135.3K | 1.02G | 10M | 1.73G | 1.73G | 2M |
| utoo | 213.7K | 119.5K | 1.03G | 10M | 1.73G | 1.73G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.82s | 0.07s | 3.75s | 1.37s | 510M | 232.2K |
| utoo-next | 5.40s | 0.02s | 1.59s | 0.93s | 87M | 21.4K |
| utoo-npm | 15.94s | 17.47s | 1.47s | 1.02s | 87M | 21.2K |
| utoo | 3.49s | 0.19s | 5.65s | 1.13s | 1.03G | 140.2K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 24.3K | 2.5K | 155M | 4M | 109M | - | 2M |
| utoo-next | 45.7K | 34.9K | 16M | 2M | - | 3M | 2M |
| utoo-npm | 46.9K | 27.8K | 16M | 2M | - | 3M | 2M |
| utoo | 49.9K | 6.3K | 151M | 3M | - | 3M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 29.80s | 8.15s | 6.02s | 9.29s | 231M | 89.9K |
| utoo-next | 49.88s | 27.68s | 5.70s | 12.05s | 438M | 58.3K |
| utoo-npm | 40.12s | 12.15s | 6.23s | 13.05s | 616M | 101.1K |
| utoo | 28.11s | 1.35s | 5.71s | 11.97s | 447M | 64.1K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 74.6K | 2.8K | 1020M | 10M | 1.74G | 1.74G | 2M |
| utoo-next | 172.1K | 45.6K | 1002M | 7M | 1.72G | 1.72G | 3M |
| utoo-npm | 186.6K | 81.3K | 1001M | 7M | 1.72G | 1.72G | 3M |
| utoo | 163.8K | 50.2K | 1002M | 7M | 1.72G | 1.72G | 3M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.35s | 0.08s | 0.19s | 2.35s | 138M | 32.1K |
| utoo-next | 2.37s | 0.09s | 0.52s | 3.90s | 81M | 18.1K |
| utoo-npm | 2.32s | 0.05s | 0.54s | 3.95s | 86M | 19.4K |
| utoo | 2.40s | 0.16s | 0.52s | 3.92s | 82M | 18.9K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 519 | 24 | 2M | 60K | 1.87G | 1.77G | 2M |
| utoo-next | 41.8K | 19.2K | 337K | 19K | 1.73G | 1.73G | 3M |
| utoo-npm | 49.0K | 22.2K | 338K | 21K | 1.73G | 1.73G | 3M |
| utoo | 43.5K | 20.1K | 337K | 20K | 1.73G | 1.73G | 3M |
📊 pm-bench-phases ·
|
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 18.99s | 2.84s | 7.01s | 20.79s | 783M | 50.6K |
| utoo-npm | 16.99s | 2.07s | 8.44s | 18.88s | 1.09G | 107.3K |
| utoo | 19.36s | 3.59s | 8.73s | 21.07s | 899M | 60.6K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 16.1K | 145.7K | - | - | 1.80G | 1.94G | 1M |
| utoo-npm | 12.1K | 344.3K | - | - | 1.67G | 1.91G | 2M |
| utoo | 13.7K | 319.6K | - | - | 1.67G | 1.87G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 2.80s | 0.18s | 2.82s | 1.35s | 488M | 31.7K |
| utoo-npm | 4.14s | 0.77s | 4.17s | 2.35s | 582M | 40.2K |
| utoo | 3.49s | 0.71s | 4.24s | 1.05s | 606M | 40.2K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 33 | 25.6K | - | - | 112M | - | 1M |
| utoo-npm | 6 | 167.2K | - | - | 27M | 3M | 2M |
| utoo | 9 | 61.4K | - | - | - | 3M | 3M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 13.92s | 0.58s | 3.58s | 16.66s | 532M | 34.6K |
| utoo-npm | 16.01s | 0.46s | 4.28s | 20.35s | 806M | 82.3K |
| utoo | 14.07s | 1.09s | 3.77s | 17.81s | 522M | 36.5K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 4.7K | 134.6K | - | - | 1.73G | 1.97G | 1M |
| utoo-npm | 1.7K | 233.8K | - | - | 1.64G | 1.88G | 2M |
| utoo | 1.5K | 211.8K | - | - | 1.64G | 1.88G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 6.16s | 2.13s | 0.12s | 2.52s | 53M | 4.0K |
| utoo-npm | 3.76s | 0.32s | 0.39s | 2.47s | 100M | 7.4K |
| utoo | 3.60s | 0.14s | 0.25s | 2.37s | 75M | 5.9K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 17.3K | 751 | - | - | 1.89G | 1.94G | 1M |
| utoo-npm | 12.8K | 67.5K | - | - | 1.64G | 1.87G | 2M |
| utoo | 14.0K | 53.8K | - | - | 1.64G | 1.87G | 2M |
npmmirror.com
p0_full_cold
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 35.47s | 2.78s | 6.29s | 16.71s | 559M | 36.2K |
| utoo-npm | 42.57s | 1.19s | 7.53s | 22.20s | 817M | 79.9K |
| utoo | 36.43s | 6.18s | 5.98s | 16.85s | 575M | 37.9K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 14.1K | 184.9K | - | - | 1.82G | 1.93G | 2M |
| utoo-npm | 1.0K | 480.3K | - | - | 1.64G | 1.90G | 2M |
| utoo | 1.1K | 420.5K | - | - | 1.64G | 1.88G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 6.48s | 1.34s | 2.54s | 1.37s | 503M | 32.8K |
| utoo-npm | 3.74s | 0.02s | 1.25s | 0.75s | 97M | 6.9K |
| utoo | 3.28s | 0.63s | 4.08s | 1.02s | 579M | 39.3K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 32 | 29.1K | - | - | 114M | - | 2M |
| utoo-npm | 6 | 217.7K | - | - | - | 3M | 2M |
| utoo | 21 | 62.2K | - | - | - | 3M | 3M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 24.59s | 1.17s | 4.06s | 17.20s | 250M | 16.6K |
| utoo-npm | 52.12s | 4.43s | 5.93s | 20.21s | 695M | 79.4K |
| utoo | 50.71s | 5.49s | 4.99s | 18.25s | 516M | 35.5K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 1.7K | 151.3K | - | - | 1.67G | 1.94G | 2M |
| utoo-npm | 1.7K | 366.2K | - | - | 1.64G | 1.91G | 3M |
| utoo | 2.3K | 291.3K | - | - | 1.64G | 1.91G | 3M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 4.82s | 0.22s | 0.11s | 2.20s | 47M | 3.6K |
| utoo-npm | 3.52s | 0.10s | 0.39s | 2.72s | 98M | 7.3K |
| utoo | 3.24s | 0.29s | 0.36s | 2.58s | 89M | 6.7K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 12.8K | 866 | - | - | 1.81G | 1.94G | 2M |
| utoo-npm | 11.9K | 69.3K | - | - | 1.64G | 1.91G | 3M |
| utoo | 12.8K | 56.2K | - | - | 1.64G | 1.91G | 3M |
Optimal-hypothesis A/B from previous sweep. Stacks on PR #2920 + reverts
skip_preload=truefrom install pipeline_deps_options to recover p3 win. Lockfile-only path keeps channel mb_fetch_with_graph for p1 win.Expected: p1 ≈ 2.78s (PR #2920 baseline), p3 ≈ 5.95s (Plan A baseline).