feature: revise node states to be simpler (WIP) by ndizazzo · Pull Request #298 · Mesh-LLM/mesh-llm

ndizazzo · 2026-04-16T05:42:04Z

Users can now read node health with a much simpler live state model. The console shows only Client, Standby, Loading, and Serving for live members, while provider-backed suspended capacity appears in a separate Wakeable Capacity section instead of being mixed into topology or routing.

After writing up #241 and chatting with @michaelneale about some hosted GPU capacity in the mesh, I decided to extend this state idea a little bit to include the concept of "Wakable Capacity". I don't know yet the mechanism we would actually use to wake up a node (ie: send a AWS / GCP / Azure / whatever API call to boot a VM), but I pre-emptively planned for this in the state changes I was making.

Summary

Simplifies the live node display to the approved four-state model.
Keeps wakeable provider-backed capacity visible, but separate from live peers.
Preserves the existing routing and peer semantics, so sleeping or waking capacity does not become routable by accident.
Updates the dashboard to show wakeable inventory as its own section.

Screenshot

(we can remove this / opt not to show it, but I mocked it up here for illustration purposes after our conversation)

Architecture

node_state and peer state are additive machine-readable fields.
node_status stays as the Title Case compatibility alias for older consumers.
Wakeable inventory uses a separate local seam, so it stays out of peers[], live topology, peer counts, and host selection.

Protocol

Mixed-version compatibility is preserved. Older peers still classify correctly, and wakeable inventory is not advertised as a live peer or protocol change. It is exposed as separate dashboard and status data only, not as routable capacity.

Copilot

Pull request overview

This PR simplifies the node health/state model exposed by /api/status and consumed by the dashboard UI, introducing a small typed live-state set (client|standby|loading|serving) and separating provider-backed “wakeable” inventory from live topology/peers.

Changes:

Adds node_state and peers[].state (typed, lowercase) to /api/status, while keeping node_status as a Title Case compatibility alias.
Updates the UI topology + node sidebar rendering to use the new live-state model and formatting helpers.
Introduces a new “Wakeable Capacity” dashboard section (plus tests) backed by a local runtime inventory seam.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
mesh-llm/ui/src/features/dashboard/components/topology/MeshTopologyDiagram.tsx	Renders topology nodes using `state` + `formatLiveNodeState` instead of legacy `statusLabel`.
mesh-llm/ui/src/features/dashboard/components/details/NodeSidebar.tsx	Displays live-state pill based on `LiveNodeState` (label/tone/tooltip) rather than string labels.
mesh-llm/ui/src/features/dashboard/components/WakeableCapacity.tsx	New dashboard card to display wakeable provider-backed inventory separately from live peers.
mesh-llm/ui/src/features/dashboard/components/WakeableCapacity.test.tsx	Adds UI test coverage for wakeable capacity rendering/visibility rules.
mesh-llm/ui/src/features/dashboard/components/DashboardPage.tsx	Switches dashboard status/peer rendering to `node_state` + `peer.state` and adds WakeableCapacity section.
mesh-llm/ui/src/features/app-shell/lib/topology-types.ts	Updates `TopologyNode` to carry `state: LiveNodeState` and removes `statusLabel`.
mesh-llm/ui/src/features/app-shell/lib/status-types.ts	Introduces `LiveNodeState`, `WakeableNode` types, and label mapping constants; extends payload/peer types.
mesh-llm/ui/src/features/app-shell/lib/status-types.test.ts	Adds type-level contract tests for required/optional fields and allowed state values.
mesh-llm/ui/src/features/app-shell/lib/status-helpers.ts	Adds `formatLiveNodeState` and updates tone/tooltip helpers to operate on `LiveNodeState`.
mesh-llm/ui/src/features/app-shell/lib/status-helpers.test.ts	Adds tests for formatter/tone/tooltip + `localRoutableModels` using `node_state`.
mesh-llm/ui/src/App.tsx	Updates topology node construction to include `state` (from `node_state`/`peer.state`) and removes legacy `statusLabel` derivation.
mesh-llm/ui/src/App.test.tsx	Updates fixtures and adds a test ensuring dashboard labels come from `node_state`/`peer.state`.
mesh-llm/src/runtime/wakeable.rs	Adds a local, in-memory wakeable inventory store with typed states and tests.
mesh-llm/src/runtime/mod.rs	Exposes the new `wakeable` runtime module.
mesh-llm/src/protocol/convert.rs	Refactors legacy GPU tuple return into a struct for clarity/maintainability.
mesh-llm/src/network/openai/transport.rs	Threads the buffered request object through routing instead of separate body/prefetch/adapter args.
mesh-llm/src/network/openai/ingress.rs	Updates routing calls to the new transport signatures and minor iterator cleanups.
mesh-llm/src/api/status.rs	Adds serialized `NodeState`, wakeable node payload types, and includes them in `/api/status`.
mesh-llm/src/api/state.rs	Adds `wakeable_inventory` to API shared state.
mesh-llm/src/api/mod.rs	Implements local/peer live-state derivation and plumbs wakeable inventory into the status payload + tests.
mesh-llm/docs/TESTING.md	Updates testing checklist to include new live-state + wakeable capacity checks.
mesh-llm/docs/DESIGN.md	Renames “Node Roles” section to “Topology Roles” and documents live-state badges at a high level.
README.md	Updates console description to reflect simplified live-state badges and separate wakeable capacity.

Copilot · 2026-04-17T01:43:42Z

+    const localClientVram = overviewVramGb(status.node_state === "client", status.my_vram_gb);
+    if (localServing && status.node_state !== "client") {
      rows.push({
        id: status.node_id,
        latencyLabel: "local",
-        vramLabel: `${localVram.toFixed(1)} GB`,
+        vramLabel: `${localClientVram.toFixed(1)} GB`,
        shareLabel:
          totalModelVram > 0
-            ? `${Math.round((localVram / totalModelVram) * 100)}%`
+            ? `${Math.round((localClientVram / totalModelVram) * 100)}%`


localClientVram is computed and used only on the non-client path (if (localServing && status.node_state !== "client")), so the name is misleading (it’s effectively the local node’s contributing VRAM). Renaming this to something like localOverviewVramGb (and using that name consistently in the % calculation) would reduce the chance of future misuse.

Copilot · 2026-04-17T01:43:43Z


 - Joiner scans the Hugging Face cache and picks an unserved model already on disk
- Log: "Assigned to serve GLM-4.7-Flash (needed by mesh, already on disk)"
+- Log: "Selected to serve GLM-4.7-Flash (needed by mesh, already on disk)"


The updated TESTING.md log snippet says Selected to serve ..., but the runtime currently logs 📋 Assigned to serve ... (see mesh-llm/src/runtime/mod.rs around the auto-assignment eprintlns). To keep the testing guide accurate, either revert the doc string or update the runtime log message to match.

Suggested change

- Log: "Selected to serve GLM-4.7-Flash (needed by mesh, already on disk)"

- Log: "📋 Assigned to serve GLM-4.7-Flash (needed by mesh, already on disk)"

Copilot · 2026-04-17T01:43:43Z

        let has_local_worker_activity = has_local_processes || !my_hosted_models.is_empty();
-        let has_split_workers = all_peers.iter().any(|p| {
-            matches!(p.role, mesh::NodeRole::Worker)
-                && p.is_assigned_model(display_model_name.as_str())
-        });
-        let node_status = Self::derive_node_status(
+        let node_state = Self::derive_local_node_state(
            is_client,
            effective_is_host,
            effective_llama_ready,
            has_local_worker_activity,
-            has_split_workers,
            display_model_name.as_str(),


derive_local_node_state can emit Standby for a worker that has been assigned work but hasn’t started a local process yet: display_model_name can come from my_serving_models, but has_local_worker_activity only checks has_local_processes || !my_hosted_models.is_empty(). Since serving_models explicitly includes assigned-but-unhealthy work, consider including !my_serving_models.is_empty() (or a dedicated has_assigned_work flag) in the inputs so assigned-but-unready nodes surface as Loading instead of Standby.

ndizazzo added 2 commits April 16, 2026 01:34

feature: revise node states to be simpler

7e2e70d

linter fixes

1e58478

ndizazzo marked this pull request as ready for review April 17, 2026 01:37

Copilot AI review requested due to automatic review settings April 17, 2026 01:37

Copilot started reviewing on behalf of ndizazzo April 17, 2026 01:37 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: revise node states to be simpler (WIP)#298

feature: revise node states to be simpler (WIP)#298
ndizazzo wants to merge 2 commits intomainfrom
feature/issue-241-node-states

ndizazzo commented Apr 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	- Log: "Selected to serve GLM-4.7-Flash (needed by mesh, already on disk)"
	- Log: "📋 Assigned to serve GLM-4.7-Flash (needed by mesh, already on disk)"

Conversation

ndizazzo commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Screenshot

Architecture

Protocol

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ndizazzo commented Apr 16, 2026 •

edited

Loading