Skip to content

Add local MLX backend support and benchmark tooling#79

Closed
i386 wants to merge 6 commits intomainfrom
codex/mlx-backend-benchmarking
Closed

Add local MLX backend support and benchmark tooling#79
i386 wants to merge 6 commits intomainfrom
codex/mlx-backend-benchmarking

Conversation

@i386
Copy link
Copy Markdown
Collaborator

@i386 i386 commented Mar 30, 2026

What this adds for users

This PR adds local mlx backend support alongside the existing llama.cpp path.

Backend behavior

  • Local MLX model directories now run through the mlx backend automatically.
  • GGUF files continue to run through the llama backend.
  • mesh-llm can launch mlx_lm.server either from PATH, via python -m mlx_lm.server, or through --mlx-server-bin / MESH_LLM_MLX_SERVER_BIN.
  • Hugging Face snapshot paths now normalize to stable repo-style model ids instead of snapshot hashes.
  • Local MLX requests rewrite the upstream model field to the actual model directory so the mesh-facing alias stays stable while the runtime still accepts the request.
  • Backend election and startup now treat MLX as solo-only for now, skipping RPC split mode when the backend does not support it.

Docs and operator workflow

  • The README now explains that local MLX model folders are supported directly.
  • evals/backend-benchmark.py was expanded into a generic local backend benchmark so the same harness can compare matched exports across different backends.
  • evals/README.md now documents both the simple llama/MLX workflow and the JSON-spec workflow for arbitrary backend combinations.

Benchmarking

  • The benchmark now accepts paired llama and MLX exports for the same model family.
  • Model identity normalization was tightened so quantized suffixes such as Q4_K_M and 4bit collapse to the same comparison key.
  • The current benchmark results were collected locally on macOS and should be treated as machine-specific rather than universal.

Current benchmark results

Paired-export benchmark using Qwen2.5-0.5B-Instruct:

  • mlx: local 4bit MLX export
  • llama: local Q4_K_M GGUF export
  • concurrency sweep: 1,4,8
  • iterations: 3
  • warmup: 1
  • max tokens: 64
backend  llama  startup_s  6.706  model_id  Qwen2.5-0.5B-Instruct-Q4_K_M
concurrency  avg_ttft_s  avg_total_s  avg_batch_wall_s  avg_decode_tok_s  avg_overall_tok_s  avg_batch_overall_tok_s
          1       0.016        0.684             0.687            95.915             93.630                   93.161
          4       0.045        1.012             1.017            66.228             63.308                  251.875
          8       0.512        1.470             1.963            66.842             48.946                  260.839

backend  mlx  startup_s  6.031  model_id  Qwen2.5-0.5B-Instruct-4bit
concurrency  avg_ttft_s  avg_total_s  avg_batch_wall_s  avg_decode_tok_s  avg_overall_tok_s  avg_batch_overall_tok_s
          1       0.103        0.520             0.520           153.386            123.120                  123.067
          4       0.394        1.471             1.477            59.463             43.544                  173.495
          8       0.881        2.380             2.488            46.009             29.898                  206.845

Key read from that run:

  • MLX wins single-request throughput and total time on this machine.
  • llama keeps a much lower TTFT at low concurrency.
  • llama scales better once concurrency increases.

Validation

  • just build
  • cargo test -p mesh-llm
  • cargo check -p mesh-llm
  • python3 -m py_compile evals/backend-benchmark.py
  • Local paired-export benchmark using Qwen2.5-0.5B-Instruct in f16/bf16 and Q4_K_M/4bit forms

@i386 i386 changed the title [codex] Add local MLX backend support and benchmark tooling [experimental] Add local MLX backend support and benchmark tooling Mar 30, 2026
@i386 i386 changed the title [experimental] Add local MLX backend support and benchmark tooling Add local MLX backend support and benchmark tooling Mar 30, 2026
# Conflicts:
#	mesh-llm/src/main.rs
#	mesh-llm/src/proxy.rs
@michaelneale
Copy link
Copy Markdown
Collaborator

super excited for this one!

@i386 i386 force-pushed the codex/mlx-backend-benchmarking branch from a86fdfc to f8ae4b4 Compare March 30, 2026 09:43
@i386 i386 changed the base branch from main to feat/model-management March 30, 2026 09:43
@i386 i386 requested a review from michaelneale March 30, 2026 10:45
@i386 i386 changed the base branch from feat/model-management to main March 31, 2026 03:33
@i386 i386 force-pushed the codex/mlx-backend-benchmarking branch from 02d186c to a86fdfc Compare March 31, 2026 03:33
@michaelneale
Copy link
Copy Markdown
Collaborator

I think can retire this one in favour of a native one: #103

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants