Skip to content

Expose presence_penalty, frequency_penalty, and per-penalty context_size on the server API#1023

Open
esaruoho wants to merge 1 commit intoBlaizzy:mainfrom
esaruoho:expose-presence-frequency-penalty
Open

Expose presence_penalty, frequency_penalty, and per-penalty context_size on the server API#1023
esaruoho wants to merge 1 commit intoBlaizzy:mainfrom
esaruoho:expose-presence-frequency-penalty

Conversation

@esaruoho
Copy link
Copy Markdown

Closes #1021.

What

Surface presence_penalty, frequency_penalty, and per-penalty *_context_size fields on the /chat/completions and /responses endpoints, and forward them to make_logits_processors. mlx_lm.sample_utils.make_logits_processors already supports these — they were just never plumbed through.

No default behaviour changes; every new field defaults to None, so make_logits_processors falls back to its existing defaults if a caller omits them. The PR is purely additive surface area.

Why

While debugging GLM-OCR-bf16 (#1021), I confirmed end-to-end with debug instrumentation that repetition_penalty is applied through the sampler — but the multiplicative penalty discounts a token once per unique appearance in the recent window (the Keskar 2019 formulation), not per occurrence. For VLMs whose logits are sharply peaked (OCR pages, classifier-style outputs), repetition_penalty=1.1 (the typical text-LLM default) is below the threshold needed to move the argmax at temperature=0, and the model gets stuck looping a short heading + paragraph dozens of times until max_tokens cuts it off.

Empirical reproduction on a single page of English prose (image-only PDF, 200 DPI), temperature=0, max_tokens=4096:

repetition_penalty extra penalty output (chars) repeated heading occurrences
1.1 16,870 54
1.5 16,594 80
2.0 1,188 2 ✅
1.1 frequency_penalty=0.5 4,444 14
1.1 frequency_penalty=1.0 4,164 7
1.1 presence_penalty=0.5 8,446 27

Ground-truth Ollama/llama.cpp on the same image: ~1,600 chars, 1 occurrence.

The point is not to argue for a particular default — it's that frequency_penalty and presence_penalty are the standard knobs for this failure mode, and they're already supported by the underlying sampler. Surfacing them lets callers tune for VLM/OCR workloads without resorting to extreme repetition_penalty values that suppress too aggressively.

Changes

  • mlx_vlm/server.py — extend GenerationParams with repetition_context_size, presence_penalty, presence_context_size, frequency_penalty, frequency_context_size (all Optional[…] = None), include them in shared_generation_kwargs. Updated the repetition_penalty description to flag the rp ≥ 2.0 caveat for VLMs. The new fields are inherited by both ChatRequest and GenerationRequest (used by /responses).
  • mlx_vlm/generate.py — extend generate_step keyword-only signature with the four new parameters, plumb them through to make_logits_processors. Built the processor kwargs as an explicit dict so unset fields are omitted rather than passed as None (preserving mlx_lm's positional defaults). Updated docstrings on both generate_step and generate.
  • mlx_vlm/tests/test_server.py — extend test_chat_completions_endpoint_forwards_explicit_sampling_args to assert the new fields reach generate(). (Could also extend test_responses_endpoint_forwards_new_sampling_args for symmetry — happy to add if you'd prefer.)

Verification

Tested on Apple Silicon (Mac Mini, mlx-vlm installed from the patched branch) against mlx-community/GLM-OCR-bf16. With debug logging, all five new params are received correctly by make_logits_processors. Different combinations of the new params produce visibly different outputs at temperature=0, confirming they reach the sampler.

repetition_penalty=2.0 alone gives a clean (1,188 chars) transcription on the previously-looping page. frequency_penalty combined with a milder repetition_penalty measurably reduces but does not fully eliminate looping at the values tested — so the new knob is a genuine tool, not a free lunch, but it's the one users currently can't reach.

Compatibility

Pure addition. Existing requests that don't set the new fields behave identically. The only behavioural change is the new field appearing in the OpenAPI schema. No version bump required.

Notes

Happy to iterate on:

  • Whether to bump repetition_penalty's default for known VLM model types (probably not — the failure surface differs by model).
  • Whether to also add a small loop-detector / N-gram check (separate change, mentioned for completeness).
  • Whether the repetition_penalty description copy is too opinionated — feel free to trim.

Thanks for maintaining this project. The Apple Silicon VLM ecosystem benefits enormously from it.

`mlx_lm.sample_utils.make_logits_processors` already supports presence_penalty,
frequency_penalty and the per-penalty context_size parameters, but the mlx-vlm
server only forwarded repetition_penalty + logit_bias. This made it impossible
to control the only knobs that reliably break high-frequency repetition loops
through the public API.

Concretely, on GLM-OCR-bf16 a full page of prose at temperature=0 with
repetition_penalty=1.1 (the typical text-LLM default) loops on the same short
heading 54 times within max_tokens=4096 — see Blaizzy#1021 for the full reproduction
and traces. The multiplicative repetition penalty applies once per *unique*
token in the recent window regardless of frequency, so for sharply-peaked VLM
logits values <~1.7 do not move the argmax. frequency_penalty (additive,
scales with count) and presence_penalty (additive, fires on first repeat)
are the standard remedies and were already supported one layer down; this
change just plumbs them through.

Changes:

* `server.py`: add `repetition_context_size`, `presence_penalty`,
  `presence_context_size`, `frequency_penalty`, `frequency_context_size`
  fields to `GenerationParams` and include them in `shared_generation_kwargs`.
  All default to None so existing client behaviour is unchanged. Updated
  `repetition_penalty` description to flag the VLM rp>=2.0 caveat.
* `generate.py`: extend `generate_step` signature with the new parameters and
  forward them to `make_logits_processors`. Build the kwargs dict explicitly
  so unset fields fall back to mlx_lm defaults instead of being overridden by
  None. Update both docstrings.
* `tests/test_server.py`: extend the chat completions forwarding test to
  assert the new fields reach `generate()`.

No default behaviour changes; this only widens the API surface so callers can
opt in. Tested locally on Apple Silicon with GLM-OCR-bf16: the new params
plumb through correctly and produce visibly different outputs at the same
temperature, confirming they reach the sampler.

Refs: Blaizzy#1021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GLM-OCR: /chat/completions produces repetition loops on prose; repetition_penalty does not prevent them

1 participant