perf(pm): MANIFESTS_CONCURRENCY 96→128 + TCP/TLS warmup by elrrrrrrr · Pull Request #2926 · utooland/utoo

elrrrrrrr · 2026-05-11T02:31:42Z

Two p1_resolve micro-opts on top of PR #2924. Target: linux p1 below 2.5s (current best ~2.42s, mean 2.80s). See commit message for theory + falsification conditions.

…down p1_resolve has been ~0.9s behind bun on phases bench for the past several PRs. Pcap on prior runs measured bun opening ~260 parallel TCP streams against registry.npmjs.org for resolve, while utoo opened ~70 (the 64 manifests-concurrency-limit cap was at saturation). Adding fetch-breakdown timing in ruborist showed where p1's 22s (local Mac) actually goes: fetch-timings: n=2730 sum_request = 1089s (88% — TCP+TLS+HTTP RTT to first byte) sum_body = 138s (11% — body download) sum_parse = 2s (0.16% — simd_json on rayon) The dominant cost is per-request RTT, not parsing or body transfer. The lever is the cap on concurrent in-flight requests. This commit: 1. Adds `crates/ruborist/src/util/timing.rs` — process-wide atomic accumulator that records per-fetch (request_us, body_us, parse_us, bytes) inside both `fetch_full_manifest` and `fetch_version_manifest`. Reset before each preload phase, dumped at INFO level after preload + bfs. 2. Bumps `manifests-concurrency-limit` default 64 → 256 to match bun's observed working point against npmjs.org. CI bench will validate. Expected: p1 utoo wall drops toward bun's range (~2.3s on GHA). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two changes after the GHA bench on the previous commit (PR #2916, run 25559625024) showed the concurrency=256 hypothesis was wrong on GHA's environment. Revert concurrency 256 → 64 --------------------------- The new fetch-timing instrumentation shipped in the previous commit caught the surprise: GHA's pcap-vs-local profile is the *opposite* of what local Mac measurements suggested. metric local Mac GHA Linux avg_request 399ms 70ms ← network MUCH faster on GHA avg_body 50ms 20ms avg_parse 730µs 266ms ← parse 365× SLOWER on GHA Mechanism: `parse_json_off_runtime` dispatches to `rayon::spawn`, and rayon's pool size is `num_cpus` (= 2 on GHA ubuntu-latest). Bumping concurrency 64 → 256 queued 256 manifest parses behind 2 rayon workers — head-of-line blocking. avg_parse jumped from ~10ms to 266ms wall, dragging p1 utoo wall from 3.10s up to 3.33s. Restore manifest-bench ---------------------- Brought back `crates/manifest-bench` (originally landed in the post-#2818 driver hunt, dropped in af714eb once #2818 graduated). It's a single-binary HTTP-only fetch tool that strips out the ruborist pipeline (no BFS, no dedup, no parse, no project cache, no lockfile write) — fires `GET <registry>/<name>` in parallel and reports the same diag shape as the new `p1-breakdown` lines. Goal: separate the network ceiling from the resolver pipeline so the next round of p1 experiments (parse offload, partial parse, dedicated parse pool, etc.) can be evaluated against a stable "pure network" baseline. Knobs (unchanged from the original drop): --concurrency N sweep without rebuilding utoo --reps N run same workload back-to-back --single-version use /<name>/latest (smaller bodies) --user-agent X UA-fingerprint experiments --http1-only H2 vs H1 toggle --accept X override Accept header Same TLS stack as ruborist (rustls + aws-lc-rs, native roots). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…inux build-linux now also builds + uploads `manifest-bench` when a phases bench is going to run (label or dispatch). bench-phases-linux downloads the binary and runs it after the regular phase-isolated benchmark. Sweep mirrors the original (#2818-era) wire-in: concurrency: 32 / 64 / 96 / 128 / 192 / 256 (HTTP/1.1, full manifest) protocol: H1 vs H2-negotiate (cap=128) endpoint: full vs `/<name>/latest` (cap=128, smaller bodies) UA: default vs `Bun/1.2.21` (cap=128) Output goes to /tmp/pm-bench-output/manifest-bench-npmjs.log and ships in the existing pm-bench-logs-linux artifact — no PR comment surface (the headline phases bench comment stays the same). Why now: the new ruborist `p1-breakdown` instrumentation showed sum_parse on GHA can dominate when concurrency is bumped (256: sum_parse 728s vs sum_request 193s). To attribute the bun-vs-utoo gap on p1_resolve we need a "pure HTTP" baseline that strips out ruborist's parse / BFS / dedup / lockfile path. manifest-bench is that baseline: same TLS stack as ruborist (rustls + aws-lc-rs, native roots), no resolver pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI fetch-breakdown on GHA (run 25562552058, conc=64) showed parse queueing on rayon dominates the gap to manifest-bench's pure-HTTP baseline: manifest-bench (pure HTTP, conc=64): 2.12s wall utoo p1 (full ruborist): 3.10s wall ← +1.0s overhead ↑ sum_parse 95s vs sum_request 95s, parse 50% of work-time ↑ avg_parse 30ms wall vs ~5ms actual CPU — the 25ms extra is rayon queue wait Mechanism: 64 concurrent tasks all dispatching parse to rayon's pool (size = num_cpus = 2 on GHA). Queue depth grows to ~32 per worker. Each parse waits 25ms+ in queue before running its 5ms of CPU work. Round 1 fix: inline parse, drop the rayon hop. simd_json on a tokio worker thread is fast (~5ms for 115KB JSON), and the tokio runtime's cooperative budget naturally rebalances CPU across the 64 tasks. Expected on next CI: - avg_parse drops from 30ms wall → ~5-10ms wall (close to CPU-only) - preload_wall drops from 5.4s → ~3.5-4s for cold runs - p1 hyperfine wall drops from 3.10s → 2.3-2.5s, narrowing the gap to manifest-bench's 2.12s ceiling If parse becomes the new bottleneck (CPU-bound), next round could look at partial parse / lazy field access. If wall doesn't drop, hypothesis is wrong and we look elsewhere (BFS, dedup, lockfile). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Round 1 (inline parse) reverted on data: GHA showed +0.37s p1 regression because parse blocked tokio runtime workers, dropping eff_parallel 42 → 35 even though per-fetch work-time fell. avg_request went up from 35ms → 52ms — symptomatic of socket reads being delayed by the parsing task on the same worker. metric round 0 (rayon) round 1 (inline) p1 wall 3.27s 3.64s ⚠️ +0.37s avg_parse 30ms (queued) 300µs ✓ avg_request 35ms 52ms ⚠️ +17ms (worker contention) eff_parallel 42 35 ⚠️ Round 2 attempts the third option: `tokio::task::spawn_blocking`. - rayon's pool was too small (num_cpus = 2 on GHA) — 64 concurrent parses queued behind 2 workers, parse wall 30ms. - inline parse held tokio worker hostage during simd_json call, starving in-flight socket reads. - tokio's blocking pool has a much larger default cap (512), so 64 concurrent parses never queue. Unlike rayon there's no contention with the install path's parallel-write rayon usage. Unlike inline the tokio runtime workers stay free to drive network I/O. Expected on next CI: - avg_parse drops to ~5-10ms wall (close to CPU floor, no queue) - avg_request stays ~35ms (workers free for I/O) - eff_parallel returns to ~50, possibly higher - p1 wall drops toward manifest-bench's 2.10s ceiling Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Round 2 moved parse_json_off_runtime off rayon (-0.11s p1). But fetch-breakdown still showed avg_request 41ms vs round 0's 35ms, hinting at a second source of rayon contention. Found it: `extract_core_version_off_runtime` is also on `rayon::spawn`. On npmjs.org's `!supports_semver` path EVERY fetch resolves through `resolve_via_full_manifest`, which fetches the full packument once per package name (deduped via inflight_full) and then calls `extract_core_version_off_runtime` per (name, spec) to materialize the chosen version into a `CoreVersionManifest`. So per fetch we hit rayon TWICE — once for the JSON parse (round 2 moved to spawn_blocking), and once for `get_core_version` (still on rayon). The second hop has the same head-of-line blocking signature as the first: 64 concurrent resolves dispatching to a 2-thread rayon pool. Round 3: move extract_core_version_off_runtime to spawn_blocking for the same reasons. The work is JSON lazy-reparse (`raw_json` sub-tree decoding) — genuinely blocking, well-suited for tokio's blocking pool. Expected: utoo p1 wall drops further toward manifest-bench's 2.10s ceiling. avg_request should fall back from 41ms → ~35ms (rayon contention removed from the fetch task's await chain). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two changes for round 4 of p1 optimization: 1. Revert `extract_core_version_off_runtime` from spawn_blocking back to rayon::spawn (round 3). Within-run measurement showed +0.42s regression vs utoo-next (round 2 was +0.11s). Likely cause: this function is called per (name, spec), so multi-spec packages call it 2-5x per fetch. spawn_blocking's per-dispatch overhead exceeds rayon queue savings at this multiplier. 2. Add `serialize_us` and `cache_export_us` to the p1-breakdown line so we can attribute the remaining gap. Currently: manifest-bench wall: 2.10s (pure HTTP ceiling) utoo p1 wall (round 2): 3.16s gap: 1.06s We have: preload_wall ≈ 2.7s (logged) bfs_wall ≈ 0.3s (logged) serialize_us ? cache_export_us ? ← suspected: full manifest deep-clone into ProjectCacheData for ~2730 entries Next round will have data to choose between attacking serialize, cache export, or the BFS loop body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Round 4 measured serialize_us = 15ms and cache_export_us = 34ms — both tiny — confirming the 1s gap from manifest-bench (utoo p1 = 3.16s vs mb wall = 2.10s) is not in post-build code. Per-fetch math also pointed at main-loop bookkeeping: manifest-bench: eff_parallel = 52 (sum_work 111s / wall 2.14s) utoo preload : eff_parallel = 43 (sum_work 120s / wall 2.85s) Same conc=64 cap, but utoo loses 9 effective slots — most likely the main loop's serial bookkeeping (dedup hash insert, format! key, extract_transitive_deps, queue push, 3-4 receiver events) holds the flow between futures.next() returning and the next fetch dispatch. This commit splits the main loop into two timed segments: preload_loop_dispatch_us: time spent in the `while in_flight < concurrency` block — popping pending, dedup check, futures.push. preload_loop_result_us: time spent processing each completed future — extract_transitive_deps, pending.extend, on_manifest. If dispatch+result sum approaches preload_wall, the main loop is the bottleneck and we need to either (a) split processing onto a dedicated task, or (b) use unbounded futures with a downstream consumer. If they're small, the gap is elsewhere (per-task overhead in resolve_package's inflight gates). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Round 5 main-loop instrumentation showed the preload main loop itself is fast (15-25ms total dispatch+result). The 0.8s gap from manifest-bench's 2.10s wall lives INSIDE the spawned fetch tasks. Per-fetch wall (warm runs): measured: avg_request 30ms + avg_body 6ms + avg_parse 2.5ms = ~38ms derived: preload_wall 2.4s × eff_parallel(43) / 2730 = 38ms delta: ~12ms unaccounted per task That 12ms is `extract_core_version_off_runtime` queueing on rayon's 2-thread pool. extract is called per (name, spec) — for ant-design that's ~3000+ calls. With pool=2 and 64 concurrent fetches each dispatching extract, the queue depth grows; each task waits its turn before extract returns. Bump rayon pool to `max(num_cpus, 8)` for non-Windows. Sizing the pool above the CPU count for short blocking JSON ops (parse + extract) replaces FIFO queueing with parallel dispatch. Real CPU contention is bounded by num_cpus (the kernel scheduler still gates), so the extra pool threads just hold ready-to-run dispatches in parallel rather than serialised in a queue. Why not just spawn_blocking (round 3 attempt): tokio's blocking pool defaults to 512 threads, but its per-dispatch overhead was higher than rayon's even when queueing — round 3 regressed by 0.5s. Expected: extract queue wait drops from ~12ms to ~1-2ms wall, p1 preload_wall narrows toward manifest-bench's 2.10s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `BuildDepsOptions::skip_preload` so callers without a pipeline consumer (utoo deps / package-lock-only) can drop the up-front preload phase entirely. BFS now batches prefetch per level across the whole frontier, then runs the existing sequential process_dependency walk against the warmed cache. For install paths (Context::pipeline_deps_options), skip_preload stays false so PackageResolved events still feed the download/clone pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds resolver::fast_preload, a manifest-bench-style flat FuturesUnordered over service::manifest::fetch_full_manifest. It warms MemoryCache (both full_manifests and version_manifests slots) synchronously after each fetch, so the BFS phase is pure cache-hit: no rayon hop on extract_core_version, no OnceMap gates, no DiskManifestStore writes, no PackageResolved events. Wired into service::api::build_deps: when the caller asks to skip preload (Context::build_deps for `utoo deps`) and there's no warm project cache, fast_preload runs ahead of build_deps_with_config. Install paths still go through preload_manifests so the pipeline keeps its early-start signal. Also reverts the per-level prefetch I added in 394f6c9 — with fast_preload pre-warming everything, BFS doesn't need its own prefetch wave. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v1 of fast_preload called settle_spec inline on the tokio worker — each settle ran simd_json::to_borrowed_value over the full manifest's raw bytes (5–10ms per spec) right on the runtime thread. CI showed it starved sibling fetches: avg_request rose +3ms, avg_parse jumped 5→11ms, p1_resolve regressed +1.0s vs the preload+BFS baseline (4.0s vs 3.0s). Fix: route every settle through extract_core_version_off_runtime (the same rayon::spawn helper the BFS path uses), and merge fetch and settle completions into a single FuturesUnordered so backpressure on either side throttles the other. Sibling specs that arrived during a fetch are now stashed by name (HashMap, not linear scan), then dispatched as their own settle futures when the fetch lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Standalone manifest-bench HTTP-only sweep (npmjs, h1) shows wall bottoming at concurrency=96 (1817ms) — earlier 256 regression was caused by rayon-queued parses behind 2 workers, no longer relevant since fetch parse is on spawn_blocking and settle is rayon-dispatched off the runtime. fast_preload's wave-shaped transitive walk currently runs at eff_parallel ~35 against the 64 cap because pending refills lag settles; raising the cap to 96 gives headroom for sustained in-flight on the deep waves without crossing the npmjs per-IP tail-latency cliff that conc 128+ trips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… path `UnifiedRegistry::resolve_version_manifest`'s first cache check (service/registry.rs:347) keys on `(name, spec)` — the original spec string the caller passed, e.g. `^4.0.0`. settle_future was only populating `(name, resolved_version)` (e.g. `4.17.21`), so on every BFS edge for `lodash@^4.0.0`-style specs the warm path missed and fell into the OnceMap inflight gate + `resolve_via_full_manifest` re-walk before recovering the manifest from the `(name, resolved_version)` slot we'd already set. Now settle writes both keys so BFS hits the early-return at service/registry.rs:347 with no further dispatch. Saves ~1 OnceMap+resolve_target_version round-trip per unique (name, spec) the BFS encounters (≈3000 calls on ant-design-x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous fast_preload (v2) dispatched primary settles to rayon as separate FuturesUnordered futures. CI breakdown showed eff_parallel ~44 against the conc=96 cap — the wave-shaped transitive walk was held back by settle dispatch RTT: each fetch landed → primary settle queued → settle popped → only then did `pending` get transitive deps and fill the next dispatch wave. v3 folds the primary settle into the fetch task itself via `tokio::task::spawn_blocking`. The fetch task does the network round-trip and the primary version-extract on the same blocking pool slot, then returns with the resolved CoreVersionManifest attached. Main loop pulls one Fetched event, immediately extends `pending`, no second `next().await` to wait through the queue. Sibling specs (rare; same name, different range) still go through the rayon settle_future path so the primary path stays lean. Carries primary_spec through FastEvent so the fused path can populate both `(name, primary_spec)` and `(name, resolved_version)` cache slots — preserves the 6455852 BFS fast-path win. FetchOutcome enum replaces by-value FetchManifestResult to avoid a full FullManifest clone (HashMap+Vec) per fetch event. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…json The fast_preload hot path was paying TWO simd_json passes per manifest: 1. fetch_full_manifest's parse_json_off_runtime did a typed simd_json::serde::from_slice<FullManifest> (envelope + IgnoredAny visitor on `versions` keys, ~3-5ms on a 100KB body). 2. Primary settle re-parsed the same raw bytes with simd_json::to_borrowed_value (~5-10ms) to extract one version's subtree. Both passes went through simd_json's Tape constructor — duplicated work. CI showed avg_parse 5-7ms × 2700 fetches = 14-19s of CPU sum on 2-core GHA, where the spawn_blocking pool's overlapping schedule masked some of the cost but not all. Adds `service::manifest::fetch_full_manifest_with_settle`: same HTTP + retry + ETag machinery as `fetch_full_manifest`, but the parse step does ONE `to_borrowed_value` and extracts: * envelope (`name`, `dist-tags`, `versions` keys) into FullManifest manually (no typed serde), and * the resolved version's subtree as a typed CoreVersionManifest (serde-deserializing that single subtree via the borrowed value). fast_preload's fetch task switches to this entry point — primary settle is now a free byproduct of the fetch parse, not a separate `to_borrowed_value` pass. Sibling specs (same name, different range) still go through the rayon settle_future path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After 671ac98's combined-parse fetch path eliminated the double simd_json pass, the spawn_blocking pool's contention ceiling rose enough that bumping concurrency past 96 no longer queues parses behind 2-core CPU. manifest-bench's most recent good-network sweep on GHA showed conc=128 hitting 1500ms vs conc=96 at 1566ms — small but real headroom for fast_preload's late-wave saturation now that initial waves fill faster. Risk: on slower-network runs (npmjs per-IP throttle), conc=128 widens p99. Earlier conc-sweep data was mixed — accepting that variance for the average-case improvement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

542d7f1's conc=128 bench landed in a slow-network run (mb best 2010ms vs 1500ms in the prior good-network run; bun also bumped to 2.14s vs 1.83s). Adjusted gap to mb best stayed flat (~700ms either way), so conc=128 didn't beat 96 across runs. Picking 96 as the conservative default: at-or-near best on every GHA run we've measured, never the worst, and leaves headroom for npmjs's per-IP throttling to absorb without compounding p99. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…preload) Adds resolver::mb_resolve module + service::build_deps_mb entry point as a parallel-track alternative to fast_preload, structured to match manifest-bench's main-loop shape as closely as correctness allows. Hypothesis under test: fast_preload's eff_parallel caps at ~50/96 because the FastEvent enum match + cache writes + sibling deferred bookkeeping in the main loop competes with tokio runtime workers for the 2 CPU cores on GHA, stalling socket I/O drive. mb_fetch pushes ALL per-fetch work into the spawned future itself (including cache writes), so the main loop is reduced to: while let Some(deps) = futs.next().await { pending.extend(deps); refill_to_cap(...); } Sibling specs (multiple ranges on same package) are NOT deferred at queue level — racing fetches for the same name both proceed. The race converges naturally: first fetch to land populates full_manifests, subsequent racers find the cache hit on entry and short-circuit to a sibling-style settle. Wastes ~5-50 network requests in real workloads but eliminates the HashMap probe + drain overhead from the hot loop. Wired in via UTOO_RESOLVE=mb env var: - Context::build_deps (utoo deps) routes through build_deps_mb - pipeline::resolve_with_pipeline (utoo install) also routes through it; pipeline workers still start but don't pipeline during fetch (mb_fetch emits no PackageResolved events) — install becomes phase-sequential, useful for resolve-phase A/B. bench script enables UTOO_RESOLVE=mb so CI measures the new path against existing baselines (utoo-next/utoo-npm/bun ignore the env var). Comment the export line to A/B back against fast_preload. Old fast_preload + UnifiedRegistry paths untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v1/v2 ran parse work in spawn_blocking inside each fetch future, which competed with tokio runtime workers for the 2 GHA cores. CI showed eff_parallel capped at 47/96 vs manifest-bench standalone's 75/96 on the same box. Hypothesis: parse CPU starves socket drive. v3 separates the two phases: * PHASE 1 — `mb_style_pure_fetch` is a structural copy of `manifest-bench`'s main loop: future body does ONLY GET + body recv, refill 1-for-1 on completion. Zero per-future CPU work, so tokio runtime workers retain full CPU for socket drive. * PHASE 2 — bulk rayon par_iter parse: for each body, parse `FullManifest` envelope via simd_json::to_borrowed_value, resolve every queued spec for this name against the just-parsed manifest, populate cache slots, collect transitive deps. Runs off the tokio runtime entirely (spawn_blocking → rayon par_iter). Phases alternate until pending exhausted. Typical project: 3-5 iterations as the dep tree fans out wave by wave. The point of the split is the `phase1_http_wall` trace — measured in isolation from any parse work, it should match manifest-bench's standalone wall (~1.5-2.0s for 2733 names @ conc=96). If it does, the remaining gap to mb is concentrated in phase 2 work, which is inherent to discovering transitive deps from a non-flat name list. Tracing per iteration: p1-breakdown mb_fetch iter=N phase1_http_wall=Xms n=Y bytes=Z p1-breakdown mb_fetch iter=N phase2_parse_wall=Xms settles=Y new_transitives=Z p1-breakdown mb_fetch total_wall=Xms iters=Y Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v3 dropped the (name, spec) HashSet from v1/v2 thinking name-level dedup via done_names was sufficient. It wasn't: sibling-settle's extract_transitive can re-introduce specs we've already settled (peer/optional dep cycles trivially trigger this), so the outer while-loop never terminated. CI 25589397823 hung on `Run phase-isolated benchmark · npmjs` for ~25 min before being cancelled — the bench's first utoo p1_resolve hyperfine run got stuck in an infinite settle loop. Fix: maintain `seen_specs: HashSet<(String, String)>` across all iterations; filter both initial seed and every wave of new transitives through it before extending pending_specs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New crate `crates/preload-bench/` is a fully-standalone bench that: * Uses the SAME HTTP setup as `manifest-bench` (own reqwest::Client built per rep with aws-lc-rs TLS, pool_max_idle_per_host(256), no proxy, default DNS, no retry, h1_only). * Discovers names by walking transitive deps from a package.json root — instead of consuming a flat name list like manifest-bench. * Per-future does GET + body recv + spawn_blocking parse → returns transitive deps → main loop refills on completion. * No dependency on ruborist or any utoo internals (own simd_json, own dedup, own everything). The point: prove (or disprove) that a fully ruborist-independent streaming preload can hit standalone manifest-bench's wall on the same workload. ruborist's path runs at ~2.18s for ant-design's ~2700 names; manifest-bench standalone runs the same workload at ~1.6s. The gap could be in any number of things — DNS layer, retry, pool config, parse-CPU contention, registry single-flight gates. preload-bench eliminates all of those simultaneously so we can read the wall directly. Wired into bench-phases-linux: builds + uploads preload-bench binary alongside manifest-bench, then runs a conc=64/96/128 sweep against the same project after the standalone manifest-bench sweep. bench script reverts UTOO_RESOLVE=mb so utoo runs default fast_preload — gives a third datapoint (utoo wall on integrated path) alongside manifest-bench (HTTP-only ceiling) and preload-bench (streaming-with-walk ceiling). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…y path Step 1 of staged service-layer ablation. Rewrites mb_resolve as a fully self-contained streaming preload mirroring preload-bench's loop shape verbatim, but living inside ruborist so it can populate MemoryCache for the BFS phase. Bypasses every other ruborist service layer: * service::http::get_client — own reqwest::Client built per call, no global LazyLock, no shared_resolver dns layer, no connect_timeout, pool_max_idle_per_host(256). * service::manifest::fetch_full_manifest_with_settle — own GET + body.bytes() + spawn_blocking(simd_json::to_borrowed_value), no RetryIf, no FETCH_TIMINGS. * service::registry::UnifiedRegistry — no OnceMap, no ManifestStore, no EventReceiver. Only service::* touched is MemoryCache writes (DashMap inserts) so BFS has data to read from. PM is unaware: dispatch happens entirely inside service::api::build_deps when skip_preload=true and no warm cache. Removes the previous UTOO_RESOLVE=mb env-var gating from pm::helper::ruborist_context::Context::build_deps and pipeline::resolve_with_pipeline. Removes the now-unused service::api::build_deps_mb sibling entry point. Expected: utoo p1_resolve drops from ~2.67s toward preload-bench's ~2.57s (or better since ruborist fetches fewer names than preload-bench). The remaining gap to mb's ~1.99s would isolate incremental layer effects we add back next: - tokio runtime config / cooperative scheduling - reqwest::Client provider differences (TLS, DNS) - cache layer (DashMap vs DiskManifestStore reads on the cold path) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mb_fetch Step 2 of staged service-layer ablation. Targets the two gaps left after step 1: 1. mb_fetch (in ruborist): 2300ms / 2735 = 0.84 ms/name manifest-bench (standalone): 2010ms / 2735 = 0.72 ms/name ~290ms gap on same workload, same conc. 2. BFS phase: 305ms wall against a fully-warm MemoryCache. Origin unclear — could be graph mutations, repeated cache lookups via the inflight gate, or event dispatch. Changes: * TLS provider — adds rustls (aws-lc-rs) + rustls-native-certs to non-wasm-non-macos targets. mb_resolve's `build_mb_client` now uses `use_preconfigured_tls(aws_lc_rs)` matching preload-bench / manifest-bench exactly. The reqwest crate's `rustls-tls-native-roots` feature on Linux still bundles ring for service::http's global client; the two providers coexist. * mb_fetch instrumentation — per-future `wall_us` (network + parse + cache writes) and `net_us` (network only) reported in the trace line as `eff_par_full`, `eff_par_net`, `avg_wall`, `avg_net`. Same shape as manifest-bench's `avg_conc` so we can compare directly. * BFS instrumentation — splits run_bfs_phase wall into: - `collect_us`: collect_unresolved_edges sum - `resolve_us`: process_dependency .await sum - `event_us`: post-resolve event dispatch (Resolved / PackagePlaced / Reused / Skipped) sum Plus `levels` and `edges` counters. Trace line lets us attribute the 305ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Step 3 of staged service-layer ablation. Targets the 305 ms BFS phase observed against a fully-warm MemoryCache — 100 % attributed to process_dependency.await sum (graph mutations) per d9fb207's new bfs instrumentation. Adds: * `process_dependency_with_resolved` in builder.rs — sync variant of process_dependency for the registry-resolved case. Skips spec-routing (only Registry handled), skips resolve_registry_dep (resolved is the parameter), skips override re-resolve. Reuses existing helpers (find_compatible_node, create_package_node, add_edges_from, mark_dependency_resolved, update_node_type_from_edge). * `mb_fetch_with_graph` in mb_resolve.rs — folded streaming preload + graph build. Each fetch result triggers inline process_dependency_with_resolved for every parent edge waiting on (name, spec). New nodes' edges feed back into pending / edge_targets, so the walk continues streaming-style. CPU work (graph mutations, ~305 ms total) overlaps with network IO (mb_fetch's wall ~2.4 s). Wires `service::api::build_deps` to use mb_fetch_with_graph for the lockfile-only path (skip_preload + cold cache). The follow-up build_deps_with_config still runs to handle any non-registry edges left unresolved (workspace / git / http / file); on registry-only workloads it's near no-op. Install path unchanged — pipeline_deps_options keeps preload + PackageResolved early-start signal for tgz download. Expected: utoo p1 wall drops from ~2.76 s toward mb_fetch wall + serialize ≈ 2.4-2.5 s on good network. Tracing line: p1-breakdown mb_fetch_with_graph wall=Xms ok=N fetch=N settle=N sum_wall=Xms sum_net=Xms sum_graph=Xms avg_net=Xus eff_par_full=N.N eff_par_net=N.N unresolved_targets=N Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

c02bb15 had unresolved_targets=583 in trace — `enqueue_node_edges` was unconditionally pushing (parent, edge_id) into edge_targets without checking if the (name, spec) was already cached. When a later transitive's edge referenced an already-fetched (name, spec), no fetch result would land to drain that bucket — the parent edges sat unresolved, potentially missing packages from the lockfile. Fix: enqueue_node_edges now checks cache.get_version_manifest first. Cache hit → process_dependency_with_resolved inline (with a work_stack to recurse into newly-Created nodes' edges). Cache miss → original behavior (stash in edge_targets, push to pending). Side effect: more inline graph mutation work in the seed phase (workspace + root edges that hit warm cache from previous specs in the same root). Should reduce the number of fetch-result events that need to do graph mutations downstream, since orphan edges no longer accumulate. Targets the correctness bug from c02bb15 trace; perf impact TBD. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 700ms gap between utoo p1 (folded mb_fetch_with_graph) and manifest-bench standalone needs network-layer evidence. Same workload, same conc, same network → why does utoo wall trail by 700ms when per-fetch latency is matched (avg_net=53us = mb p50=40us ish)? Hypotheses to test via pcap diff: * Fewer concurrent TCP streams in flight at any moment (utoo's main loop CPU steals tokio dispatch capacity → in-flight count drops below conc cap) * More TLS handshakes (utoo's connection pool isn't reusing as effectively as mb's per-rep fresh client) * Larger inter-packet gaps per stream (utoo's runtime pauses mid download) * Different concurrent-stream-time profile (wave shape) Adds two captures at end of pm-bench-pcap.sh: manifest-bench-c96 — flat lockfile-derived names @ conc=96 preload-bench-c96 — transitive walk @ conc=96 (matches utoo's walk shape, but no graph build) Each captured with the same tcpdump + iostat as the existing utoo / utoo-next / bun captures. analyze_pcap globs *.pcap so the new files get the same TCP signal extraction (zwin / retx / dup_ack / per-stream gap p50/p99/max / distinct streams). Workflow: downloads manifest-bench-linux-x64 + preload-bench-linux-x64 artifacts (built by build-linux's benchmark-label conditional steps) into the pm-bench-pcap-linux job env so pm-bench-pcap.sh can find them. Trigger: workflow_dispatch with target=pm-bench-pcap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous pm-bench-pcap artifact was 2GB (raw .pcap files for every PM × phase × bench), making the round-trip download impractical just to read JSON metrics. Adds a separate `pm-bench-pcap-summaries` artifact containing only the *.json / *.log / *.iostat.txt / dns.txt files — KB scale, downloads in seconds. Raw pcap artifact is preserved for cases where we want to re-run tshark with different filters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The pm-bench-pcap artifact is ~2 GB (pcap binaries dominate). gh run download keeps timing out before completion. Two fixes: 1. New `pm-bench-pcap-summaries` artifact uploads only the JSON summaries + .log + iostat.txt + dns.txt (small, fast download). The full pcap artifact stays for deep inspection when needed. 2. End of pm-bench-pcap.sh prints a tab-separated comparison table (name, wall_s, packets, streams, zwin, retx, dup_ack, gap_p99_us, gap_max_us) to stdout, so the data is visible in the CI run log without downloading anything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…raph The pcap evidence (utoo-resolve zwin=71 vs mb-c96 zwin=49) confirmed main loop CPU was starving tokio runtime workers from polling sockets. Inline graph mutations (sum_graph=450ms across the fetch loop) blocked the worker between awaits, so TCP receive buffers filled and the server paused sending — directly extending wall. This refactor: * Spawns `graph_worker` as a separate tokio task (gets its own runtime worker thread on multi-thread runtime). Owns the DependencyGraph + edge_targets + seen_specs. * Main loop owns FuturesUnordered + body_cache + dispatch state. No graph mutations on this path. * mpsc channels: main → graph (FetchEventMsg, just the name — cache writes already in the future), graph → main (Vec<Dep> new pending specs to extend the fetch queue). * `tokio::select!` with `biased` drains specs first to unblock fetch dispatch. * `in_flight_graph` counter tracks outstanding messages to graph worker — termination = futs empty + in_flight_graph == 0. Function signature changed: takes `mut graph: DependencyGraph` by value, returns `(DependencyGraph, MbFetchStats)` since the worker task needs ownership of the graph (can't borrow across spawn). api.rs caller threads the graph through. Expected: zwin drops back toward mb's ~49 (no more main loop starvation), eff_par_net climbs from 56 toward mb's 72, wall saves ~200ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plumb the PipelineReceiver through the folded mb_fetch_with_graph path so install (`utoo install`) gets the same channel-separated fetch + graph architecture as `utoo deps`, with download/clone pipelines starting as early as the legacy preload+BFS path: - mb_fetch_with_graph now takes Arc<R: EventReceiver + 'static>; main loop emits PackageResolved on each fetch land (looked up via cache with the new FetchOutcome::primary_spec), graph_worker emits PackagePlaced on ProcessResult::Created. - service::api::build_deps wraps the caller-supplied receiver in Arc once and shares it between mb_fetch_with_graph and build_deps_with_config; adds + 'static bound on R. - pipeline_deps_options sets skip_preload=true so install routes through the same folded path as the lockfile-only command. CI will validate that p1 resolve continues at/below 2.5s while p0_full_cold and p3_cold_install do not regress (download + clone pipelines remain saturated via emitted events). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously spawn_fetch / spawn_settle used the raw dep key as both the registry path segment and the cache lookup key. For an npm-alias dep like \`\"ms\": \"npm:raw-body@2.1.3\"\` this hit \`registry/ms\` instead of \`registry/raw-body\`, parsed ms's manifest against \`npm:raw-body@2.1.3\`, and ultimately installed the real ms into \`node_modules/ms/\` rather than raw-body. e2e \`utoo-pm.sh:466\` (\"top-level ms should be raw-body\") caught this on d1cf53e. Fix: - spawn_fetch / spawn_settle call \`normalize_spec\` to split out the real package name + spec; URL hits \`registry/{real_name}\` and the combined parse runs against \`real_spec\` so version resolution sees the right manifest envelope. - Cache writes go under both keys: the original \`(alias_name, alias_spec)\` so \`graph_worker\` finds the manifest via \`edge_targets\`, and the normalized \`(real_name, resolved_version)\` for direct-dep dedup. - Main loop dedup state (in_flight_names / deferred_by_name / body_cache) keys by real_name so two distinct aliases pointing at the same registry package share dedup; deferred entries store \`(alias_name, spec)\` so the drain spawns spawn_settle with the correct cache key. - Adds \`real_name\` to FetchOutcome so the deferred-drain step can look up by real name without re-normalizing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GHA ubuntu-latest is a 2-core runner; tokio's default worker_threads = num_cpus = 2. The install hot path multiplexes four concurrent task families on those workers: - mb_fetch_with_graph main loop (drives sockets + FuturesUnordered) - graph_worker (\`tokio::spawn\`, CPU-heavy) - pipeline download workers (PackageResolved → tarball fetch) - pipeline clone workers (PackagePlaced → hardlink/clonefile) Under that load, graph_worker can monopolize a worker thread for tens of ms at a stretch, starving the main loop's socket polling. The symptom is a wave-shape collapse (eff_par_full 73-77 → 40, mb_fetch wall 4-6s → 10s+) that pushes p0_full_cold tail by 3-5s on the affected run. Floor worker_threads at \`max(num_cpus, 4)\` so the runtime always has headroom to keep the resolve hot path on its own worker even when the install pipeline saturates the others. No-op on 4+ core machines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

graph_worker is pure CPU + channel IO; on a multi-thread tokio runtime it sat on a worker thread that the install pipeline (download + clone + extract) and the resolve main loop were also competing for. Under the 2-core GHA ubuntu runner that contention produced the eff_par_full collapse (73-77 → 40) and 4-6s → 10s+ mb_fetch wall on the p0/p1 outlier runs. Move it to \`tokio::task::spawn_blocking\`: - Convert graph_worker from \`async fn\` to sync \`fn\`; channel IO uses \`mpsc::Receiver::blocking_recv\` and \`mpsc::Sender::blocking_send\`, which are tokio-supported when called outside an async context. - Spawn site wraps it in a \`move ||\` closure so spawn_blocking owns the captured state. \`graph_handle.await\` keeps the same shape — a spawn_blocking JoinHandle is awaitable. The blocking pool defaults to 512 threads, so reserving one slot for graph_worker has no scheduling effect on download/clone/extract spawn_blocking calls. Net effect: the resolve worker can no longer be preempted by graph mutation bursts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…client builder PR #2916/#2920's channel architecture broke utooweb-ci wasm build: error[E0277]: \`Rc<RefCell<wasm_bindgen_futures::Inner>>\` cannot be sent between threads safely error[E0599]: no method named \`no_proxy\` found for struct \`ClientBuilder\` in the current scope error[E0277]: \`*mut u8\` cannot be sent between threads safely note: required because it appears within the type \`reqwest::wasm::AbortGuard\` Two root causes: 1. \`mb_resolve.rs::mb_fetch_with_graph\` calls \`tokio::task::spawn_blocking(move || graph_worker(...))\` which requires Send on the closure. wasm32 reqwest's \`AbortGuard\` contains \`Rc<RefCell<...>>\` which is not Send. wasm tokio is single-threaded with no \`spawn_blocking\` semantics either. 2. \`mb_resolve.rs::build_mb_client\` wasm32 variant called \`.no_proxy()\` which doesn't exist on wasm reqwest's \`ClientBuilder\` (proxy settings live in browser fetch config). Fix: - Gate \`mb_fetch_with_graph\` and its caller in \`service::api::build_deps\` with \`#[cfg(not(target_arch = "wasm32"))]\`. wasm32 falls back to the legacy preload + BFS path (\`build_deps_with_config\`). - Drop \`.no_proxy()\` from the wasm32 \`build_mb_client\` body. Net effect: native keeps the channel mb_fetch_with_graph for the p1 win; wasm regains a buildable resolve path via legacy preload+BFS. The legacy code in \`resolver::preload\` and BFS in \`resolver::builder\` must remain reachable for the wasm carve-out; plans to delete legacy code post-channel must keep a wasm32-gated fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #2924 cfg-gated mb_fetch_with_graph but utooweb-ci still failed: error[E0432]: unresolved import \`crate::resolver::mb_resolve::mb_fetch_with_graph\` error[E0277]: \`Rc<RefCell<wasm_bindgen_futures::Inner>>\` cannot be sent between threads safely in \`crates/ruborist/src/resolver/fast_preload.rs:225\` in \`crates/ruborist/src/resolver/mb_resolve.rs:229\` Two more places miss wasm gating: 1. \`fast_preload.rs\` builds \`Pin<Box<dyn Future + Send>>\` over reqwest in its FuturesUnordered. wasm reqwest's response futures hold \`Rc<RefCell<wasm_bindgen_futures::Inner>>\` which is !Send → entire module won't compile on wasm32. 2. \`mb_resolve.rs::Fut\` type alias is also \`Pin<Box<dyn Future + Send>>\` so even non-mb_fetch_with_graph functions in the file fail Send. Fix at the module boundary in \`resolver/mod.rs\`: #[cfg(not(target_arch = "wasm32"))] pub mod fast_preload; #[cfg(not(target_arch = "wasm32"))] pub mod mb_resolve; Plus gate the import in \`service::api\` with the same cfg. wasm callers keep using legacy preload + BFS via \`build_deps_with_config\`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #2916 added \`skip_preload\` field to \`BuildDepsOptions\`. Native callers in \`pm/helper/ruborist_context.rs\` were updated, but \`utoo-wasm/src/deps.rs\` constructs the options inline and was missed: error[E0063]: missing field \`skip_preload\` in initializer of \`BuildDepsOptions<_, _>\` Set \`skip_preload: false\` so wasm callers stay on the legacy preload + BFS path (channel mb_fetch_with_graph requires multi-thread tokio + Send-safe types, both unavailable on wasm32). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two p1_resolve micro-opts targeting the remaining gap to manifest-bench theoretical optimum (~2.0s) on linux GHA. PR #2920's clean linux p1 landed at 2.73-2.88s; bun's best is 1.87s. 1. **MANIFESTS_CONCURRENCY 96 → 128**: 96 was chosen pre-channel-arch to avoid spawn_blocking pool contention. The channel architecture (PR #2916/#2920) moved graph_worker to the blocking pool and freed that pool for parse work. Manifest-bench h1 sweep on GHA showed 96-128 sweet spot; revisit 128 with clean A/B. Expected: lift eff_par_full from ~75 → 80+, ≈0.1-0.2s on p1. 2. **TCP/TLS warmup**: pre-issue 16 parallel HEAD requests to the registry root before the first fetch wave. Without warmup the first ~16 fetches each pay fresh TCP+TLS handshakes (50-100ms each on linux GHA), forming a visible wave-start ramp in eff_par_full instrumentation. The HEAD bodies are discarded; the side effect — warm sockets in reqwest's pool_max_idle_per_host(256) — is what matters. Fire-and-forget: doesn't await; by the time the first fetch wave dispatches (~5-15ms of CPU on edge_targets walk), the HEADs are in flight. Expected: 0.05-0.1s on p1 by collapsing the ramp. Falsification: - if 128 regresses → revert to 96 (npmjs throttling exceeds queue benefit) - if warmup HEADs cause registry-side rate limit / 429 → revert warmup count to 4-8 or drop entirely Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request implements several performance optimizations for dependency resolution, notably introducing a folded streaming graph build and optimizing manifest fetching by bypassing service layers in specific paths. It also adds standalone benchmarking tools (manifest-bench and preload-bench) and enhanced timing instrumentation. Reviewers suggested using the stable Rust 2021 edition instead of 2024, optimizing path parsing in the benchmark tool using rsplit, and adopting more idiomatic Arc cloning patterns for better readability.

gemini-code-assist · 2026-05-11T02:36:07Z

+[package]
+name        = "manifest-bench"
+version     = "0.0.0"
+edition     = "2024"


The Rust 2024 edition is not yet stable and using it might introduce build instabilities or require a specific toolchain version. It's generally recommended to use the latest stable edition, which is "2021", to ensure broader compatibility and stability.

Suggested change

edition = "2024"

edition = "2021"

gemini-code-assist · 2026-05-11T02:36:07Z

+fn last_module_name(key: &str) -> String {
+    let parts: Vec<&str> = key.split("node_modules/").collect();
+    let tail = parts.last().copied().unwrap_or("");
+    tail.to_string()
+}


The current implementation of last_module_name allocates an intermediate Vec which is not very efficient. You can achieve the same result more efficiently by using rsplit to find the last segment without collecting all parts into a vector.

fn last_module_name(key: &str) -> String { key.rsplit("node_modules/").next().unwrap_or("").to_string() }

gemini-code-assist · 2026-05-11T02:36:07Z

+[package]
+name        = "preload-bench"
+version     = "0.0.0"
+edition     = "2024"


The Rust 2024 edition is not yet stable and using it might introduce build instabilities or require a specific toolchain version. It's generally recommended to use the latest stable edition, which is "2021", to ensure broader compatibility and stability.

Suggested change

edition = "2024"

edition = "2021"

gemini-code-assist · 2026-05-11T02:36:07Z

+) -> Option<Vec<(String, String)>> {
+    use simd_json::prelude::{ValueAsObject, ValueObjectAccess};
+
+    let mut buf = (**raw).clone();


While (**raw).clone() is correct, it's a bit hard to read due to multiple dereferences. A more idiomatic and readable way to achieve the same result (cloning the Vec<u8> from &Arc<Vec<u8>>) would be to use raw.to_vec(), which leverages Deref coercions.

Suggested change

let mut buf = (**raw).clone();

let mut buf = raw.to_vec();

github-actions · 2026-05-11T03:06:56Z

📊 pm-bench-phases · `d732182` · linux (`ubuntu-latest`)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM	wall	±σ	user	sys	RSS	pgMinor
bun	9.46s	0.14s	10.43s	10.38s	731M	329.1K
utoo-next	8.27s	0.27s	10.71s	12.45s	963M	119.8K
utoo-npm	8.40s	0.06s	11.27s	12.91s	1.52G	187.6K
utoo	10.30s	3.53s	11.73s	12.99s	1.67G	219.7K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	14.9K	17.6K	1.20G	6M	1.89G	1.77G	1M
utoo-next	122.8K	91.9K	1.17G	5M	1.73G	1.73G	2M
utoo-npm	126.3K	105.4K	1.17G	5M	1.73G	1.73G	2M
utoo	137.2K	115.3K	1.18G	6M	1.73G	1.73G	3M

p1_resolve

PM	wall	±σ	user	sys	RSS	pgMinor
bun	2.22s	0.20s	3.96s	1.05s	504M	174.1K
utoo-next	3.22s	0.10s	5.50s	1.98s	608M	84.3K
utoo-npm	3.14s	0.06s	5.50s	1.97s	612M	88.7K
utoo	2.75s	0.06s	5.48s	1.12s	1.03G	149.6K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	9.4K	4.1K	203M	3M	107M	-	1M
utoo-next	71.3K	120.9K	201M	2M	7M	3M	3M
utoo-npm	70.9K	114.0K	201M	2M	7M	3M	2M
utoo	41.5K	9.7K	202M	3M	-	3M	2M

p3_cold_install

PM	wall	±σ	user	sys	RSS	pgMinor
bun	6.61s	0.26s	6.35s	10.09s	615M	206.8K
utoo-next	6.08s	0.12s	4.95s	10.85s	420M	59.5K
utoo-npm	6.57s	0.11s	5.47s	11.31s	929M	125.3K
utoo	7.89s	3.66s	5.20s	10.93s	695M	81.9K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	4.4K	6.9K	1.00G	4M	1.78G	1.78G	1M
utoo-next	93.8K	49.8K	1001M	2M	1.73G	1.73G	2M
utoo-npm	104.4K	65.8K	1001M	3M	1.73G	1.73G	2M
utoo	101.5K	70.0K	1001M	3M	1.73G	1.73G	2M

p4_warm_link

PM	wall	±σ	user	sys	RSS	pgMinor
bun	3.29s	0.06s	0.22s	2.37s	139M	34.4K
utoo-next	2.46s	0.21s	0.49s	3.79s	79M	18.1K
utoo-npm	2.37s	0.20s	0.52s	3.85s	83M	18.8K
utoo	2.30s	0.07s	0.48s	3.76s	79M	18.0K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	261	23	5M	30K	1.93G	1.77G	1M
utoo-next	42.3K	17.7K	1K	6K	1.73G	1.73G	2M
utoo-npm	46.2K	20.2K	16K	9K	1.73G	1.73G	2M
utoo	40.5K	17.4K	2K	11K	1.73G	1.73G	2M

npmmirror.com

p0_full_cold

PM	wall	±σ	user	sys	RSS	pgMinor
bun	19.98s	0.49s	9.54s	9.81s	557M	388.3K
utoo-next	17.48s	1.82s	7.02s	12.91s	517M	61.7K
utoo-npm	15.88s	0.50s	7.72s	13.84s	973M	118.2K
utoo	10.01s	0.10s	11.43s	12.84s	1.63G	213.2K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	51.7K	6.0K	1.16G	9M	1.89G	1.76G	2M
utoo-next	197.7K	113.2K	1018M	8M	1.73G	1.73G	2M
utoo-npm	207.4K	174.8K	1018M	8M	1.73G	1.73G	3M
utoo	138.9K	142.0K	1.14G	7M	1.73G	1.73G	2M

p1_resolve

PM	wall	±σ	user	sys	RSS	pgMinor
bun	1.82s	0.17s	3.96s	1.19s	563M	190.0K
utoo-next	5.57s	6.16s	1.44s	0.92s	86M	21.4K
utoo-npm	1.98s	0.04s	1.35s	0.88s	88M	21.3K
utoo	2.44s	0.02s	5.29s	1.02s	1.02G	135.0K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	10.3K	5.0K	155M	3M	109M	-	2M
utoo-next	42.4K	55.4K	16M	2M	-	3M	2M
utoo-npm	39.3K	73.5K	16M	2M	-	3M	2M
utoo	32.8K	9.1K	151M	2M	-	3M	3M

p3_cold_install

PM	wall	±σ	user	sys	RSS	pgMinor
bun	15.71s	0.23s	6.00s	8.98s	273M	121.2K
utoo-next	19.92s	0.62s	5.68s	11.68s	385M	47.3K
utoo-npm	33.70s	23.09s	6.31s	12.84s	633M	100.1K
utoo	12.59s	0.56s	5.72s	11.61s	457M	58.7K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	25.9K	3.7K	1019M	6M	1.74G	1.74G	2M
utoo-next	161.6K	58.2K	1015M	6M	1.72G	1.72G	2M
utoo-npm	177.8K	104.0K	1002M	6M	1.72G	1.72G	2M
utoo	129.4K	135.9K	1003M	5M	1.72G	1.72G	2M

p4_warm_link

PM	wall	±σ	user	sys	RSS	pgMinor
bun	3.41s	0.06s	0.22s	2.34s	138M	32.5K
utoo-next	2.43s	0.07s	0.53s	4.00s	84M	19.7K
utoo-npm	2.36s	0.10s	0.53s	4.04s	86M	19.6K
utoo	2.44s	0.22s	0.53s	4.01s	84M	18.8K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	388	23	2M	39K	1.87G	1.77G	2M
utoo-next	43.6K	20.9K	328K	23K	1.73G	1.73G	3M
utoo-npm	48.4K	21.4K	330K	37K	1.73G	1.73G	3M
utoo	43.6K	19.7K	330K	39K	1.73G	1.73G	3M

github-actions · 2026-05-11T03:19:19Z

📊 pm-bench-phases · `d732182` · mac (`macos-latest`)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM	wall	±σ	user	sys	RSS	pgMinor
bun	13.73s	0.27s	5.51s	14.29s	758M	49.0K
utoo-npm	13.62s	0.58s	7.14s	14.98s	1.01G	105.3K
utoo	12.31s	0.49s	6.76s	14.01s	950M	68.8K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	16.6K	152.3K	-	-	1.80G	1.95G	1M
utoo-npm	12.8K	328.9K	-	-	1.67G	1.88G	2M
utoo	10.3K	311.0K	-	-	1.64G	1.92G	3M

p1_resolve

PM	wall	±σ	user	sys	RSS	pgMinor
bun	2.40s	0.05s	2.40s	0.96s	491M	32.0K
utoo-npm	3.04s	0.07s	3.54s	1.83s	593M	39.8K
utoo	2.79s	0.14s	3.97s	0.98s	629M	42.0K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	7	27.2K	-	-	112M	-	1M
utoo-npm	4	193.5K	-	-	27M	3M	2M
utoo	21	65.7K	-	-	-	3M	2M

p3_cold_install

PM	wall	±σ	user	sys	RSS	pgMinor
bun	13.75s	2.63s	3.45s	16.15s	515M	33.5K
utoo-npm	11.30s	1.85s	3.28s	12.86s	759M	76.3K
utoo	10.24s	0.87s	2.95s	12.57s	574M	38.4K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	4.9K	136.5K	-	-	1.73G	1.97G	1M
utoo-npm	1.7K	277.7K	-	-	1.64G	1.88G	2M
utoo	1.5K	266.7K	-	-	1.64G	1.88G	2M

p4_warm_link

PM	wall	±σ	user	sys	RSS	pgMinor
bun	5.03s	0.20s	0.12s	2.27s	52M	3.9K
utoo-npm	4.02s	0.30s	0.41s	2.88s	91M	6.8K
utoo	5.57s	1.40s	0.42s	4.09s	79M	6.0K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	14.8K	742	-	-	1.90G	1.94G	1M
utoo-npm	12.7K	69.1K	-	-	1.64G	1.92G	2M
utoo	14.6K	60.5K	-	-	1.64G	1.92G	2M

npmmirror.com

p0_full_cold

PM	wall	±σ	user	sys	RSS	pgMinor
bun	28.18s	1.37s	6.96s	20.59s	577M	37.3K
utoo-npm	30.98s	1.30s	7.13s	22.76s	814M	80.9K
utoo	28.77s	4.49s	10.05s	22.12s	1008M	67.0K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	14.8K	173.1K	-	-	1.83G	1.94G	2M
utoo-npm	892	448.4K	-	-	1.64G	1.91G	2M
utoo	1.5K	422.9K	-	-	1.64G	1.92G	2M

p1_resolve

PM	wall	±σ	user	sys	RSS	pgMinor
bun	2.89s	0.34s	3.09s	1.67s	549M	35.7K
utoo-npm	12.01s	11.03s	2.43s	1.73s	97M	6.9K
utoo	3.11s	0.46s	4.50s	1.33s	597M	40.6K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	5	23.9K	-	-	114M	-	2M
utoo-npm	4	182.8K	-	-	-	3M	2M
utoo	21	64.7K	-	-	-	3M	3M

p3_cold_install

PM	wall	±σ	user	sys	RSS	pgMinor
bun	18.36s	0.81s	3.91s	18.20s	327M	21.5K
utoo-npm	33.70s	3.23s	5.21s	19.42s	704M	76.8K
utoo	48.47s	23.80s	5.97s	24.20s	484M	32.7K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	1.9K	163.8K	-	-	1.67G	1.94G	2M
utoo-npm	1.6K	342.1K	-	-	1.64G	1.91G	3M
utoo	1.5K	367.0K	-	-	1.64G	1.91G	3M

p4_warm_link

PM	wall	±σ	user	sys	RSS	pgMinor
bun	5.41s	2.24s	0.10s	2.35s	47M	3.6K
utoo-npm	4.50s	0.60s	0.54s	3.56s	91M	6.7K
utoo	5.17s	0.21s	0.51s	3.58s	88M	6.7K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	14.3K	1.1K	-	-	1.81G	1.94G	2M
utoo-npm	12.0K	70.3K	-	-	1.64G	1.91G	3M
utoo	13.8K	58.6K	-	-	1.64G	1.91G	3M

elrrrrrrr · 2026-05-11T08:42:24Z

Superseded by #2929 (perf/p1-sibling-two-phase) — same p1 attack line.

elrrrrrrr and others added 30 commits May 8, 2026 21:56

elrrrrrrr and others added 8 commits May 9, 2026 23:54

elrrrrrrr added the benchmark Run pm-bench on PR label May 11, 2026

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

elrrrrrrr closed this May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(pm): MANIFESTS_CONCURRENCY 96→128 + TCP/TLS warmup#2926

perf(pm): MANIFESTS_CONCURRENCY 96→128 + TCP/TLS warmup#2926
elrrrrrrr wants to merge 38 commits into
nextfrom
perf/p1-concurrency-warmup

elrrrrrrr commented May 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

gemini-code-assist Bot May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

elrrrrrrr commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

elrrrrrrr commented May 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 11, 2026

📊 pm-bench-phases · d732182 · linux (ubuntu-latest)

npmjs.org

p0_full_cold

p1_resolve

p3_cold_install

p4_warm_link

npmmirror.com

p0_full_cold

p1_resolve

p3_cold_install

p4_warm_link

Uh oh!

github-actions Bot commented May 11, 2026

📊 pm-bench-phases · d732182 · mac (macos-latest)

npmjs.org

p0_full_cold

p1_resolve

p3_cold_install

p4_warm_link

npmmirror.com

p0_full_cold

p1_resolve

p3_cold_install

p4_warm_link

Uh oh!

elrrrrrrr commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

📊 pm-bench-phases · `d732182` · linux (`ubuntu-latest`)

📊 pm-bench-phases · `d732182` · mac (`macos-latest`)