Expose presence_penalty, frequency_penalty, and per-penalty context_size on the server API by esaruoho · Pull Request #1023 · Blaizzy/mlx-vlm

esaruoho · 2026-04-14T14:33:40Z

Closes #1021.

What

Surface presence_penalty, frequency_penalty, and per-penalty *_context_size fields on the /chat/completions and /responses endpoints, and forward them to make_logits_processors. mlx_lm.sample_utils.make_logits_processors already supports these — they were just never plumbed through.

No default behaviour changes; every new field defaults to None, so make_logits_processors falls back to its existing defaults if a caller omits them. The PR is purely additive surface area.

Why

While debugging GLM-OCR-bf16 (#1021), I confirmed end-to-end with debug instrumentation that repetition_penalty is applied through the sampler — but the multiplicative penalty discounts a token once per unique appearance in the recent window (the Keskar 2019 formulation), not per occurrence. For VLMs whose logits are sharply peaked (OCR pages, classifier-style outputs), repetition_penalty=1.1 (the typical text-LLM default) is below the threshold needed to move the argmax at temperature=0, and the model gets stuck looping a short heading + paragraph dozens of times until max_tokens cuts it off.

Empirical reproduction on a single page of English prose (image-only PDF, 200 DPI), temperature=0, max_tokens=4096:

`repetition_penalty`	extra penalty	output (chars)	repeated heading occurrences
1.1	—	16,870	54
1.5	—	16,594	80
2.0	—	1,188	2 ✅
1.1	`frequency_penalty=0.5`	4,444	14
1.1	`frequency_penalty=1.0`	4,164	7
1.1	`presence_penalty=0.5`	8,446	27

Ground-truth Ollama/llama.cpp on the same image: ~1,600 chars, 1 occurrence.

The point is not to argue for a particular default — it's that frequency_penalty and presence_penalty are the standard knobs for this failure mode, and they're already supported by the underlying sampler. Surfacing them lets callers tune for VLM/OCR workloads without resorting to extreme repetition_penalty values that suppress too aggressively.

Changes

mlx_vlm/server.py — extend GenerationParams with repetition_context_size, presence_penalty, presence_context_size, frequency_penalty, frequency_context_size (all Optional[…] = None), include them in shared_generation_kwargs. Updated the repetition_penalty description to flag the rp ≥ 2.0 caveat for VLMs. The new fields are inherited by both ChatRequest and GenerationRequest (used by /responses).
mlx_vlm/generate.py — extend generate_step keyword-only signature with the four new parameters, plumb them through to make_logits_processors. Built the processor kwargs as an explicit dict so unset fields are omitted rather than passed as None (preserving mlx_lm's positional defaults). Updated docstrings on both generate_step and generate.
mlx_vlm/tests/test_server.py — extend test_chat_completions_endpoint_forwards_explicit_sampling_args to assert the new fields reach generate(). (Could also extend test_responses_endpoint_forwards_new_sampling_args for symmetry — happy to add if you'd prefer.)

Verification

Tested on Apple Silicon (Mac Mini, mlx-vlm installed from the patched branch) against mlx-community/GLM-OCR-bf16. With debug logging, all five new params are received correctly by make_logits_processors. Different combinations of the new params produce visibly different outputs at temperature=0, confirming they reach the sampler.

repetition_penalty=2.0 alone gives a clean (1,188 chars) transcription on the previously-looping page. frequency_penalty combined with a milder repetition_penalty measurably reduces but does not fully eliminate looping at the values tested — so the new knob is a genuine tool, not a free lunch, but it's the one users currently can't reach.

Compatibility

Pure addition. Existing requests that don't set the new fields behave identically. The only behavioural change is the new field appearing in the OpenAPI schema. No version bump required.

Notes

Happy to iterate on:

Whether to bump repetition_penalty's default for known VLM model types (probably not — the failure surface differs by model).
Whether to also add a small loop-detector / N-gram check (separate change, mentioned for completeness).
Whether the repetition_penalty description copy is too opinionated — feel free to trim.

Thanks for maintaining this project. The Apple Silicon VLM ecosystem benefits enormously from it.

`mlx_lm.sample_utils.make_logits_processors` already supports presence_penalty, frequency_penalty and the per-penalty context_size parameters, but the mlx-vlm server only forwarded repetition_penalty + logit_bias. This made it impossible to control the only knobs that reliably break high-frequency repetition loops through the public API. Concretely, on GLM-OCR-bf16 a full page of prose at temperature=0 with repetition_penalty=1.1 (the typical text-LLM default) loops on the same short heading 54 times within max_tokens=4096 — see Blaizzy#1021 for the full reproduction and traces. The multiplicative repetition penalty applies once per *unique* token in the recent window regardless of frequency, so for sharply-peaked VLM logits values <~1.7 do not move the argmax. frequency_penalty (additive, scales with count) and presence_penalty (additive, fires on first repeat) are the standard remedies and were already supported one layer down; this change just plumbs them through. Changes: * `server.py`: add `repetition_context_size`, `presence_penalty`, `presence_context_size`, `frequency_penalty`, `frequency_context_size` fields to `GenerationParams` and include them in `shared_generation_kwargs`. All default to None so existing client behaviour is unchanged. Updated `repetition_penalty` description to flag the VLM rp>=2.0 caveat. * `generate.py`: extend `generate_step` signature with the new parameters and forward them to `make_logits_processors`. Build the kwargs dict explicitly so unset fields fall back to mlx_lm defaults instead of being overridden by None. Update both docstrings. * `tests/test_server.py`: extend the chat completions forwarding test to assert the new fields reach `generate()`. No default behaviour changes; this only widens the API surface so callers can opt in. Tested locally on Apple Silicon with GLM-OCR-bf16: the new params plumb through correctly and produce visibly different outputs at the same temperature, confirming they reach the sampler. Refs: Blaizzy#1021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expose presence_penalty, frequency_penalty, and per-penalty context_size on the server API#1023

Expose presence_penalty, frequency_penalty, and per-penalty context_size on the server API#1023
esaruoho wants to merge 1 commit intoBlaizzy:mainfrom
esaruoho:expose-presence-frequency-penalty

esaruoho commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

esaruoho commented Apr 14, 2026

What

Why

Changes

Verification

Compatibility

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant