Skip to content

Local LLM hangs on CPU: config context_length ignored, 128K hardcoded #35

@MFA-X-AI

Description

@MFA-X-AI

The embedded llama.cpp provider ignores context_length from config and always allocates a 128K context window. On CPU this allocates ~2 GB per inference call, causing the system to thrash and the agent to appear stuck at "typing."

Steps to Reproduce

  1. Configure a local model (e.g., qwen3.5:0.8b) with context_length: 32768
  2. Run xpressclaw up
  3. Send a message in the web UI
  4. Agent shows "typing" indefinitely, eventually returns:
    Control request timeout: initializeNo response from agent.
    

Expected Behavior

The provider should use the configured context_length: 32768, allocating ~375 MB instead of ~2 GB per context.

Actual Behavior

LlamaCppProvider uses DEFAULT_CONTEXT_LENGTH = 131_072 (128K). The config value is never passed through.

Relevant code:

  • crates/xpressclaw-core/src/llm/llamacpp.rs:141 — hardcoded default:
    const DEFAULT_CONTEXT_LENGTH: u32 = 131_072;
  • crates/xpressclaw-core/src/llm/router.rs:282 — config not passed:
    LlamaCppProvider::from_path(path, model_name)
    // config.context_length is never used

Impact

Each context allocates on CPU:

Buffer Size
KV cache 1,536 MB
Compute buffer 489 MB
Total per context ~2 GB

A new context is created per inference call. When the agent runs in Docker, the container's SDK calls back to the server for LLM access, creating additional contexts. This cascading allocation causes:

  • sched_reserve times escalating: 1,050ms → 1,708ms → 3,754ms (memory thrashing)
  • Docker harness timeout — the container's init request can't get a response while the server is allocating memory
  • The error Control request timeout: initializeNo response from agent. originates from the agent SDK inside the container, not the server itself

Fix

One line in crates/xpressclaw-core/src/llm/router.rs:282:

// Before
LlamaCppProvider::from_path(path, model_name)

// After
LlamaCppProvider::from_path(path, model_name)
    .map(|p| p.with_context_length(config.context_length))

with_context_length already exists on the provider.

Environment

  • OS: Windows 10 Home
  • Laptop: ASUS TUF Gaming FX505DV
  • CPU: AMD Ryzen 7 3750H (4 cores / 8 threads)
  • RAM: 16 GB (14 GB usable)
  • GPU: NVIDIA GeForce RTX 2060 (4 GB VRAM) — unused, see note below
  • Model: qwen3.5:0.8b (Qwen3.5-0.8B-UD-Q4_K_XL.gguf)
  • Config: context_length: 32768

Note: GPU not utilized

The cuda feature exists in crates/xpressclaw-core/Cargo.toml but is not enabled by default. The 0.8B model could theoritically fit entirely in the RTX 2060's 4 GB VRAM and run significantly faster than CPU inference.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions