Local LLM hangs on CPU: config context_length ignored, 128K hardcoded

The embedded llama.cpp provider ignores `context_length` from config and always allocates a 128K context window. On CPU this allocates ~2 GB per inference call, causing the system to thrash and the agent to appear stuck at "typing."

## Steps to Reproduce

1. Configure a local model (e.g., `qwen3.5:0.8b`) with `context_length: 32768`
2. Run `xpressclaw up`
3. Send a message in the web UI
4. Agent shows "typing" indefinitely, eventually returns:
   ```
   Control request timeout: initializeNo response from agent.
   ```

## Expected Behavior

The provider should use the configured `context_length: 32768`, allocating ~375 MB instead of ~2 GB per context.

## Actual Behavior

`LlamaCppProvider` uses `DEFAULT_CONTEXT_LENGTH = 131_072` (128K). The config value is never passed through.

**Relevant code:**

- `crates/xpressclaw-core/src/llm/llamacpp.rs:141` — hardcoded default:
  ```rust
  const DEFAULT_CONTEXT_LENGTH: u32 = 131_072;
  ```
- `crates/xpressclaw-core/src/llm/router.rs:282` — config not passed:
  ```rust
  LlamaCppProvider::from_path(path, model_name)
  // config.context_length is never used
  ```

## Impact

Each context allocates on CPU:

| Buffer | Size |
|--------|------|
| KV cache | 1,536 MB |
| Compute buffer | 489 MB |
| **Total per context** | **~2 GB** |

A new context is created per inference call. When the agent runs in Docker, the container's SDK calls back to the server for LLM access, creating additional contexts. This cascading allocation causes:

- `sched_reserve` times escalating: 1,050ms → 1,708ms → 3,754ms (memory thrashing)
- Docker harness timeout — the container's init request can't get a response while the server is allocating memory
- The error `Control request timeout: initializeNo response from agent.` originates from the agent SDK inside the container, not the server itself

## Fix

One line in `crates/xpressclaw-core/src/llm/router.rs:282`:

```rust
// Before
LlamaCppProvider::from_path(path, model_name)

// After
LlamaCppProvider::from_path(path, model_name)
    .map(|p| p.with_context_length(config.context_length))
```

`with_context_length` already exists on the provider.

## Environment

- **OS:** Windows 10 Home
- **Laptop:** ASUS TUF Gaming FX505DV
- **CPU:** AMD Ryzen 7 3750H (4 cores / 8 threads)
- **RAM:** 16 GB (14 GB usable)
- **GPU:** NVIDIA GeForce RTX 2060 (4 GB VRAM) — unused, see note below
- **Model:** `qwen3.5:0.8b` (Qwen3.5-0.8B-UD-Q4_K_XL.gguf)
- **Config:** `context_length: 32768`

### Note: GPU not utilized

The `cuda` feature exists in `crates/xpressclaw-core/Cargo.toml` but is not enabled by default. The 0.8B model could theoritically fit entirely in the RTX 2060's 4 GB VRAM and run significantly faster than CPU inference.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local LLM hangs on CPU: config context_length ignored, 128K hardcoded #35

Steps to Reproduce

Expected Behavior

Actual Behavior

Impact

Fix

Environment

Note: GPU not utilized

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Buffer	Size
KV cache	1,536 MB
Compute buffer	489 MB
Total per context	~2 GB

Local LLM hangs on CPU: config context_length ignored, 128K hardcoded #35

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Impact

Fix

Environment

Note: GPU not utilized

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions