Add local MLX backend support and benchmark tooling by i386 · Pull Request #79 · Mesh-LLM/mesh-llm

i386 · 2026-03-30T02:53:25Z

What this adds for users

This PR adds local mlx backend support alongside the existing llama.cpp path.

Backend behavior

Local MLX model directories now run through the mlx backend automatically.
GGUF files continue to run through the llama backend.
mesh-llm can launch mlx_lm.server either from PATH, via python -m mlx_lm.server, or through --mlx-server-bin / MESH_LLM_MLX_SERVER_BIN.
Hugging Face snapshot paths now normalize to stable repo-style model ids instead of snapshot hashes.
Local MLX requests rewrite the upstream model field to the actual model directory so the mesh-facing alias stays stable while the runtime still accepts the request.
Backend election and startup now treat MLX as solo-only for now, skipping RPC split mode when the backend does not support it.

Docs and operator workflow

The README now explains that local MLX model folders are supported directly.
evals/backend-benchmark.py was expanded into a generic local backend benchmark so the same harness can compare matched exports across different backends.
evals/README.md now documents both the simple llama/MLX workflow and the JSON-spec workflow for arbitrary backend combinations.

Benchmarking

The benchmark now accepts paired llama and MLX exports for the same model family.
Model identity normalization was tightened so quantized suffixes such as Q4_K_M and 4bit collapse to the same comparison key.
The current benchmark results were collected locally on macOS and should be treated as machine-specific rather than universal.

Current benchmark results

Paired-export benchmark using Qwen2.5-0.5B-Instruct:

mlx: local 4bit MLX export
llama: local Q4_K_M GGUF export
concurrency sweep: 1,4,8
iterations: 3
warmup: 1
max tokens: 64

backend  llama  startup_s  6.706  model_id  Qwen2.5-0.5B-Instruct-Q4_K_M
concurrency  avg_ttft_s  avg_total_s  avg_batch_wall_s  avg_decode_tok_s  avg_overall_tok_s  avg_batch_overall_tok_s
          1       0.016        0.684             0.687            95.915             93.630                   93.161
          4       0.045        1.012             1.017            66.228             63.308                  251.875
          8       0.512        1.470             1.963            66.842             48.946                  260.839

backend  mlx  startup_s  6.031  model_id  Qwen2.5-0.5B-Instruct-4bit
concurrency  avg_ttft_s  avg_total_s  avg_batch_wall_s  avg_decode_tok_s  avg_overall_tok_s  avg_batch_overall_tok_s
          1       0.103        0.520             0.520           153.386            123.120                  123.067
          4       0.394        1.471             1.477            59.463             43.544                  173.495
          8       0.881        2.380             2.488            46.009             29.898                  206.845

Key read from that run:

MLX wins single-request throughput and total time on this machine.
llama keeps a much lower TTFT at low concurrency.
llama scales better once concurrency increases.

Validation

just build
cargo test -p mesh-llm
cargo check -p mesh-llm
python3 -m py_compile evals/backend-benchmark.py
Local paired-export benchmark using Qwen2.5-0.5B-Instruct in f16/bf16 and Q4_K_M/4bit forms

# Conflicts: # mesh-llm/src/main.rs # mesh-llm/src/proxy.rs

michaelneale · 2026-03-30T06:02:16Z

super excited for this one!

michaelneale · 2026-03-31T06:08:01Z

I think can retire this one in favour of a native one: #103

Add local MLX backend support and benchmark tooling

6d7abfc

i386 changed the title ~~[codex] Add local MLX backend support and benchmark tooling~~ [experimental] Add local MLX backend support and benchmark tooling Mar 30, 2026

i386 changed the title ~~[experimental] Add local MLX backend support and benchmark tooling~~ Add local MLX backend support and benchmark tooling Mar 30, 2026

i386 added the experimental label Mar 30, 2026

Merge remote-tracking branch 'origin/main' into pr-79

726a629

# Conflicts: # mesh-llm/src/main.rs # mesh-llm/src/proxy.rs

i386 added 4 commits March 30, 2026 18:49

fix: preserve HF provenance in mlx model ids

4f330e7

feat: use provenance sidecars for mlx identity

c1536df

fix: make benchmark pairing provenance-first

05c9774

Merge origin/main into codex/mlx-backend-benchmarking

a86fdfc

i386 force-pushed the codex/mlx-backend-benchmarking branch from a86fdfc to f8ae4b4 Compare March 30, 2026 09:43

i386 changed the base branch from main to feat/model-management March 30, 2026 09:43

i386 added the call for testers label Mar 30, 2026

i386 requested a review from michaelneale March 30, 2026 10:45

i386 changed the base branch from feat/model-management to main March 31, 2026 03:33

i386 force-pushed the codex/mlx-backend-benchmarking branch from 02d186c to a86fdfc Compare March 31, 2026 03:33

michaelneale closed this Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add local MLX backend support and benchmark tooling#79

Add local MLX backend support and benchmark tooling#79
i386 wants to merge 6 commits intomainfrom
codex/mlx-backend-benchmarking

i386 commented Mar 30, 2026

Uh oh!

michaelneale commented Mar 30, 2026

Uh oh!

michaelneale commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

i386 commented Mar 30, 2026

What this adds for users

Backend behavior

Docs and operator workflow

Benchmarking

Current benchmark results

Validation

Uh oh!

michaelneale commented Mar 30, 2026

Uh oh!

michaelneale commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants