feature: revise node states to be simpler (WIP)#298
Conversation
There was a problem hiding this comment.
Pull request overview
This PR simplifies the node health/state model exposed by /api/status and consumed by the dashboard UI, introducing a small typed live-state set (client|standby|loading|serving) and separating provider-backed “wakeable” inventory from live topology/peers.
Changes:
- Adds
node_stateandpeers[].state(typed, lowercase) to/api/status, while keepingnode_statusas a Title Case compatibility alias. - Updates the UI topology + node sidebar rendering to use the new live-state model and formatting helpers.
- Introduces a new “Wakeable Capacity” dashboard section (plus tests) backed by a local runtime inventory seam.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| mesh-llm/ui/src/features/dashboard/components/topology/MeshTopologyDiagram.tsx | Renders topology nodes using state + formatLiveNodeState instead of legacy statusLabel. |
| mesh-llm/ui/src/features/dashboard/components/details/NodeSidebar.tsx | Displays live-state pill based on LiveNodeState (label/tone/tooltip) rather than string labels. |
| mesh-llm/ui/src/features/dashboard/components/WakeableCapacity.tsx | New dashboard card to display wakeable provider-backed inventory separately from live peers. |
| mesh-llm/ui/src/features/dashboard/components/WakeableCapacity.test.tsx | Adds UI test coverage for wakeable capacity rendering/visibility rules. |
| mesh-llm/ui/src/features/dashboard/components/DashboardPage.tsx | Switches dashboard status/peer rendering to node_state + peer.state and adds WakeableCapacity section. |
| mesh-llm/ui/src/features/app-shell/lib/topology-types.ts | Updates TopologyNode to carry state: LiveNodeState and removes statusLabel. |
| mesh-llm/ui/src/features/app-shell/lib/status-types.ts | Introduces LiveNodeState, WakeableNode types, and label mapping constants; extends payload/peer types. |
| mesh-llm/ui/src/features/app-shell/lib/status-types.test.ts | Adds type-level contract tests for required/optional fields and allowed state values. |
| mesh-llm/ui/src/features/app-shell/lib/status-helpers.ts | Adds formatLiveNodeState and updates tone/tooltip helpers to operate on LiveNodeState. |
| mesh-llm/ui/src/features/app-shell/lib/status-helpers.test.ts | Adds tests for formatter/tone/tooltip + localRoutableModels using node_state. |
| mesh-llm/ui/src/App.tsx | Updates topology node construction to include state (from node_state/peer.state) and removes legacy statusLabel derivation. |
| mesh-llm/ui/src/App.test.tsx | Updates fixtures and adds a test ensuring dashboard labels come from node_state/peer.state. |
| mesh-llm/src/runtime/wakeable.rs | Adds a local, in-memory wakeable inventory store with typed states and tests. |
| mesh-llm/src/runtime/mod.rs | Exposes the new wakeable runtime module. |
| mesh-llm/src/protocol/convert.rs | Refactors legacy GPU tuple return into a struct for clarity/maintainability. |
| mesh-llm/src/network/openai/transport.rs | Threads the buffered request object through routing instead of separate body/prefetch/adapter args. |
| mesh-llm/src/network/openai/ingress.rs | Updates routing calls to the new transport signatures and minor iterator cleanups. |
| mesh-llm/src/api/status.rs | Adds serialized NodeState, wakeable node payload types, and includes them in /api/status. |
| mesh-llm/src/api/state.rs | Adds wakeable_inventory to API shared state. |
| mesh-llm/src/api/mod.rs | Implements local/peer live-state derivation and plumbs wakeable inventory into the status payload + tests. |
| mesh-llm/docs/TESTING.md | Updates testing checklist to include new live-state + wakeable capacity checks. |
| mesh-llm/docs/DESIGN.md | Renames “Node Roles” section to “Topology Roles” and documents live-state badges at a high level. |
| README.md | Updates console description to reflect simplified live-state badges and separate wakeable capacity. |
| const localClientVram = overviewVramGb(status.node_state === "client", status.my_vram_gb); | ||
| if (localServing && status.node_state !== "client") { | ||
| rows.push({ | ||
| id: status.node_id, | ||
| latencyLabel: "local", | ||
| vramLabel: `${localVram.toFixed(1)} GB`, | ||
| vramLabel: `${localClientVram.toFixed(1)} GB`, | ||
| shareLabel: | ||
| totalModelVram > 0 | ||
| ? `${Math.round((localVram / totalModelVram) * 100)}%` | ||
| ? `${Math.round((localClientVram / totalModelVram) * 100)}%` |
There was a problem hiding this comment.
localClientVram is computed and used only on the non-client path (if (localServing && status.node_state !== "client")), so the name is misleading (it’s effectively the local node’s contributing VRAM). Renaming this to something like localOverviewVramGb (and using that name consistently in the % calculation) would reduce the chance of future misuse.
|
|
||
| - Joiner scans the Hugging Face cache and picks an unserved model already on disk | ||
| - Log: "Assigned to serve GLM-4.7-Flash (needed by mesh, already on disk)" | ||
| - Log: "Selected to serve GLM-4.7-Flash (needed by mesh, already on disk)" |
There was a problem hiding this comment.
The updated TESTING.md log snippet says Selected to serve ..., but the runtime currently logs 📋 Assigned to serve ... (see mesh-llm/src/runtime/mod.rs around the auto-assignment eprintlns). To keep the testing guide accurate, either revert the doc string or update the runtime log message to match.
| - Log: "Selected to serve GLM-4.7-Flash (needed by mesh, already on disk)" | |
| - Log: "📋 Assigned to serve GLM-4.7-Flash (needed by mesh, already on disk)" |
| let has_local_worker_activity = has_local_processes || !my_hosted_models.is_empty(); | ||
| let has_split_workers = all_peers.iter().any(|p| { | ||
| matches!(p.role, mesh::NodeRole::Worker) | ||
| && p.is_assigned_model(display_model_name.as_str()) | ||
| }); | ||
| let node_status = Self::derive_node_status( | ||
| let node_state = Self::derive_local_node_state( | ||
| is_client, | ||
| effective_is_host, | ||
| effective_llama_ready, | ||
| has_local_worker_activity, | ||
| has_split_workers, | ||
| display_model_name.as_str(), |
There was a problem hiding this comment.
derive_local_node_state can emit Standby for a worker that has been assigned work but hasn’t started a local process yet: display_model_name can come from my_serving_models, but has_local_worker_activity only checks has_local_processes || !my_hosted_models.is_empty(). Since serving_models explicitly includes assigned-but-unhealthy work, consider including !my_serving_models.is_empty() (or a dedicated has_assigned_work flag) in the inputs so assigned-but-unready nodes surface as Loading instead of Standby.
Users can now read node health with a much simpler live state model. The console shows only
Client,Standby,Loading, andServingfor live members, while provider-backed suspended capacity appears in a separateWakeable Capacitysection instead of being mixed into topology or routing.After writing up #241 and chatting with @michaelneale about some hosted GPU capacity in the mesh, I decided to extend this state idea a little bit to include the concept of "Wakable Capacity". I don't know yet the mechanism we would actually use to wake up a node (ie: send a AWS / GCP / Azure / whatever API call to boot a VM), but I pre-emptively planned for this in the state changes I was making.
Summary
Screenshot
Architecture
node_stateand peerstateare additive machine-readable fields.node_statusstays as the Title Case compatibility alias for older consumers.peers[], live topology, peer counts, and host selection.Protocol
Mixed-version compatibility is preserved. Older peers still classify correctly, and wakeable inventory is not advertised as a live peer or protocol change. It is exposed as separate dashboard and status data only, not as routable capacity.