The embedded llama.cpp provider ignores context_length from config and always allocates a 128K context window. On CPU this allocates ~2 GB per inference call, causing the system to thrash and the agent to appear stuck at "typing."
Steps to Reproduce
- Configure a local model (e.g.,
qwen3.5:0.8b) with context_length: 32768
- Run
xpressclaw up
- Send a message in the web UI
- Agent shows "typing" indefinitely, eventually returns:
Control request timeout: initializeNo response from agent.
Expected Behavior
The provider should use the configured context_length: 32768, allocating ~375 MB instead of ~2 GB per context.
Actual Behavior
LlamaCppProvider uses DEFAULT_CONTEXT_LENGTH = 131_072 (128K). The config value is never passed through.
Relevant code:
crates/xpressclaw-core/src/llm/llamacpp.rs:141 — hardcoded default:
const DEFAULT_CONTEXT_LENGTH: u32 = 131_072;
crates/xpressclaw-core/src/llm/router.rs:282 — config not passed:
LlamaCppProvider::from_path(path, model_name)
// config.context_length is never used
Impact
Each context allocates on CPU:
| Buffer |
Size |
| KV cache |
1,536 MB |
| Compute buffer |
489 MB |
| Total per context |
~2 GB |
A new context is created per inference call. When the agent runs in Docker, the container's SDK calls back to the server for LLM access, creating additional contexts. This cascading allocation causes:
sched_reserve times escalating: 1,050ms → 1,708ms → 3,754ms (memory thrashing)
- Docker harness timeout — the container's init request can't get a response while the server is allocating memory
- The error
Control request timeout: initializeNo response from agent. originates from the agent SDK inside the container, not the server itself
Fix
One line in crates/xpressclaw-core/src/llm/router.rs:282:
// Before
LlamaCppProvider::from_path(path, model_name)
// After
LlamaCppProvider::from_path(path, model_name)
.map(|p| p.with_context_length(config.context_length))
with_context_length already exists on the provider.
Environment
- OS: Windows 10 Home
- Laptop: ASUS TUF Gaming FX505DV
- CPU: AMD Ryzen 7 3750H (4 cores / 8 threads)
- RAM: 16 GB (14 GB usable)
- GPU: NVIDIA GeForce RTX 2060 (4 GB VRAM) — unused, see note below
- Model:
qwen3.5:0.8b (Qwen3.5-0.8B-UD-Q4_K_XL.gguf)
- Config:
context_length: 32768
Note: GPU not utilized
The cuda feature exists in crates/xpressclaw-core/Cargo.toml but is not enabled by default. The 0.8B model could theoritically fit entirely in the RTX 2060's 4 GB VRAM and run significantly faster than CPU inference.
The embedded llama.cpp provider ignores
context_lengthfrom config and always allocates a 128K context window. On CPU this allocates ~2 GB per inference call, causing the system to thrash and the agent to appear stuck at "typing."Steps to Reproduce
qwen3.5:0.8b) withcontext_length: 32768xpressclaw upExpected Behavior
The provider should use the configured
context_length: 32768, allocating ~375 MB instead of ~2 GB per context.Actual Behavior
LlamaCppProviderusesDEFAULT_CONTEXT_LENGTH = 131_072(128K). The config value is never passed through.Relevant code:
crates/xpressclaw-core/src/llm/llamacpp.rs:141— hardcoded default:crates/xpressclaw-core/src/llm/router.rs:282— config not passed:Impact
Each context allocates on CPU:
A new context is created per inference call. When the agent runs in Docker, the container's SDK calls back to the server for LLM access, creating additional contexts. This cascading allocation causes:
sched_reservetimes escalating: 1,050ms → 1,708ms → 3,754ms (memory thrashing)Control request timeout: initializeNo response from agent.originates from the agent SDK inside the container, not the server itselfFix
One line in
crates/xpressclaw-core/src/llm/router.rs:282:with_context_lengthalready exists on the provider.Environment
qwen3.5:0.8b(Qwen3.5-0.8B-UD-Q4_K_XL.gguf)context_length: 32768Note: GPU not utilized
The
cudafeature exists incrates/xpressclaw-core/Cargo.tomlbut is not enabled by default. The 0.8B model could theoritically fit entirely in the RTX 2060's 4 GB VRAM and run significantly faster than CPU inference.