Skip to content

feature: revise node states to be simpler (WIP)#298

Open
ndizazzo wants to merge 2 commits intomainfrom
feature/issue-241-node-states
Open

feature: revise node states to be simpler (WIP)#298
ndizazzo wants to merge 2 commits intomainfrom
feature/issue-241-node-states

Conversation

@ndizazzo
Copy link
Copy Markdown
Collaborator

@ndizazzo ndizazzo commented Apr 16, 2026

Users can now read node health with a much simpler live state model. The console shows only Client, Standby, Loading, and Serving for live members, while provider-backed suspended capacity appears in a separate Wakeable Capacity section instead of being mixed into topology or routing.

After writing up #241 and chatting with @michaelneale about some hosted GPU capacity in the mesh, I decided to extend this state idea a little bit to include the concept of "Wakable Capacity". I don't know yet the mechanism we would actually use to wake up a node (ie: send a AWS / GCP / Azure / whatever API call to boot a VM), but I pre-emptively planned for this in the state changes I was making.

Summary

  • Simplifies the live node display to the approved four-state model.
  • Keeps wakeable provider-backed capacity visible, but separate from live peers.
  • Preserves the existing routing and peer semantics, so sleeping or waking capacity does not become routable by accident.
  • Updates the dashboard to show wakeable inventory as its own section.

Screenshot

wakeable-capacity-8-hosts image (we can remove this / opt not to show it, but I mocked it up here for illustration purposes after our conversation)

Architecture

  • node_state and peer state are additive machine-readable fields.
  • node_status stays as the Title Case compatibility alias for older consumers.
  • Wakeable inventory uses a separate local seam, so it stays out of peers[], live topology, peer counts, and host selection.

Protocol

Mixed-version compatibility is preserved. Older peers still classify correctly, and wakeable inventory is not advertised as a live peer or protocol change. It is exposed as separate dashboard and status data only, not as routable capacity.

@ndizazzo ndizazzo marked this pull request as ready for review April 17, 2026 01:37
Copilot AI review requested due to automatic review settings April 17, 2026 01:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR simplifies the node health/state model exposed by /api/status and consumed by the dashboard UI, introducing a small typed live-state set (client|standby|loading|serving) and separating provider-backed “wakeable” inventory from live topology/peers.

Changes:

  • Adds node_state and peers[].state (typed, lowercase) to /api/status, while keeping node_status as a Title Case compatibility alias.
  • Updates the UI topology + node sidebar rendering to use the new live-state model and formatting helpers.
  • Introduces a new “Wakeable Capacity” dashboard section (plus tests) backed by a local runtime inventory seam.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
mesh-llm/ui/src/features/dashboard/components/topology/MeshTopologyDiagram.tsx Renders topology nodes using state + formatLiveNodeState instead of legacy statusLabel.
mesh-llm/ui/src/features/dashboard/components/details/NodeSidebar.tsx Displays live-state pill based on LiveNodeState (label/tone/tooltip) rather than string labels.
mesh-llm/ui/src/features/dashboard/components/WakeableCapacity.tsx New dashboard card to display wakeable provider-backed inventory separately from live peers.
mesh-llm/ui/src/features/dashboard/components/WakeableCapacity.test.tsx Adds UI test coverage for wakeable capacity rendering/visibility rules.
mesh-llm/ui/src/features/dashboard/components/DashboardPage.tsx Switches dashboard status/peer rendering to node_state + peer.state and adds WakeableCapacity section.
mesh-llm/ui/src/features/app-shell/lib/topology-types.ts Updates TopologyNode to carry state: LiveNodeState and removes statusLabel.
mesh-llm/ui/src/features/app-shell/lib/status-types.ts Introduces LiveNodeState, WakeableNode types, and label mapping constants; extends payload/peer types.
mesh-llm/ui/src/features/app-shell/lib/status-types.test.ts Adds type-level contract tests for required/optional fields and allowed state values.
mesh-llm/ui/src/features/app-shell/lib/status-helpers.ts Adds formatLiveNodeState and updates tone/tooltip helpers to operate on LiveNodeState.
mesh-llm/ui/src/features/app-shell/lib/status-helpers.test.ts Adds tests for formatter/tone/tooltip + localRoutableModels using node_state.
mesh-llm/ui/src/App.tsx Updates topology node construction to include state (from node_state/peer.state) and removes legacy statusLabel derivation.
mesh-llm/ui/src/App.test.tsx Updates fixtures and adds a test ensuring dashboard labels come from node_state/peer.state.
mesh-llm/src/runtime/wakeable.rs Adds a local, in-memory wakeable inventory store with typed states and tests.
mesh-llm/src/runtime/mod.rs Exposes the new wakeable runtime module.
mesh-llm/src/protocol/convert.rs Refactors legacy GPU tuple return into a struct for clarity/maintainability.
mesh-llm/src/network/openai/transport.rs Threads the buffered request object through routing instead of separate body/prefetch/adapter args.
mesh-llm/src/network/openai/ingress.rs Updates routing calls to the new transport signatures and minor iterator cleanups.
mesh-llm/src/api/status.rs Adds serialized NodeState, wakeable node payload types, and includes them in /api/status.
mesh-llm/src/api/state.rs Adds wakeable_inventory to API shared state.
mesh-llm/src/api/mod.rs Implements local/peer live-state derivation and plumbs wakeable inventory into the status payload + tests.
mesh-llm/docs/TESTING.md Updates testing checklist to include new live-state + wakeable capacity checks.
mesh-llm/docs/DESIGN.md Renames “Node Roles” section to “Topology Roles” and documents live-state badges at a high level.
README.md Updates console description to reflect simplified live-state badges and separate wakeable capacity.

Comment on lines +229 to +237
const localClientVram = overviewVramGb(status.node_state === "client", status.my_vram_gb);
if (localServing && status.node_state !== "client") {
rows.push({
id: status.node_id,
latencyLabel: "local",
vramLabel: `${localVram.toFixed(1)} GB`,
vramLabel: `${localClientVram.toFixed(1)} GB`,
shareLabel:
totalModelVram > 0
? `${Math.round((localVram / totalModelVram) * 100)}%`
? `${Math.round((localClientVram / totalModelVram) * 100)}%`
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localClientVram is computed and used only on the non-client path (if (localServing && status.node_state !== "client")), so the name is misleading (it’s effectively the local node’s contributing VRAM). Renaming this to something like localOverviewVramGb (and using that name consistently in the % calculation) would reduce the chance of future misuse.

Copilot uses AI. Check for mistakes.
Comment thread mesh-llm/docs/TESTING.md

- Joiner scans the Hugging Face cache and picks an unserved model already on disk
- Log: "Assigned to serve GLM-4.7-Flash (needed by mesh, already on disk)"
- Log: "Selected to serve GLM-4.7-Flash (needed by mesh, already on disk)"
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated TESTING.md log snippet says Selected to serve ..., but the runtime currently logs 📋 Assigned to serve ... (see mesh-llm/src/runtime/mod.rs around the auto-assignment eprintlns). To keep the testing guide accurate, either revert the doc string or update the runtime log message to match.

Suggested change
- Log: "Selected to serve GLM-4.7-Flash (needed by mesh, already on disk)"
- Log: "📋 Assigned to serve GLM-4.7-Flash (needed by mesh, already on disk)"

Copilot uses AI. Check for mistakes.
Comment thread mesh-llm/src/api/mod.rs
Comment on lines 928 to 934
let has_local_worker_activity = has_local_processes || !my_hosted_models.is_empty();
let has_split_workers = all_peers.iter().any(|p| {
matches!(p.role, mesh::NodeRole::Worker)
&& p.is_assigned_model(display_model_name.as_str())
});
let node_status = Self::derive_node_status(
let node_state = Self::derive_local_node_state(
is_client,
effective_is_host,
effective_llama_ready,
has_local_worker_activity,
has_split_workers,
display_model_name.as_str(),
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

derive_local_node_state can emit Standby for a worker that has been assigned work but hasn’t started a local process yet: display_model_name can come from my_serving_models, but has_local_worker_activity only checks has_local_processes || !my_hosted_models.is_empty(). Since serving_models explicitly includes assigned-but-unhealthy work, consider including !my_serving_models.is_empty() (or a dedicated has_assigned_work flag) in the inputs so assigned-but-unready nodes surface as Loading instead of Standby.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants