Expose presence_penalty, frequency_penalty, and per-penalty context_size on the server API#1023
Open
esaruoho wants to merge 1 commit intoBlaizzy:mainfrom
Open
Expose presence_penalty, frequency_penalty, and per-penalty context_size on the server API#1023esaruoho wants to merge 1 commit intoBlaizzy:mainfrom
esaruoho wants to merge 1 commit intoBlaizzy:mainfrom
Conversation
`mlx_lm.sample_utils.make_logits_processors` already supports presence_penalty, frequency_penalty and the per-penalty context_size parameters, but the mlx-vlm server only forwarded repetition_penalty + logit_bias. This made it impossible to control the only knobs that reliably break high-frequency repetition loops through the public API. Concretely, on GLM-OCR-bf16 a full page of prose at temperature=0 with repetition_penalty=1.1 (the typical text-LLM default) loops on the same short heading 54 times within max_tokens=4096 — see Blaizzy#1021 for the full reproduction and traces. The multiplicative repetition penalty applies once per *unique* token in the recent window regardless of frequency, so for sharply-peaked VLM logits values <~1.7 do not move the argmax. frequency_penalty (additive, scales with count) and presence_penalty (additive, fires on first repeat) are the standard remedies and were already supported one layer down; this change just plumbs them through. Changes: * `server.py`: add `repetition_context_size`, `presence_penalty`, `presence_context_size`, `frequency_penalty`, `frequency_context_size` fields to `GenerationParams` and include them in `shared_generation_kwargs`. All default to None so existing client behaviour is unchanged. Updated `repetition_penalty` description to flag the VLM rp>=2.0 caveat. * `generate.py`: extend `generate_step` signature with the new parameters and forward them to `make_logits_processors`. Build the kwargs dict explicitly so unset fields fall back to mlx_lm defaults instead of being overridden by None. Update both docstrings. * `tests/test_server.py`: extend the chat completions forwarding test to assert the new fields reach `generate()`. No default behaviour changes; this only widens the API surface so callers can opt in. Tested locally on Apple Silicon with GLM-OCR-bf16: the new params plumb through correctly and produce visibly different outputs at the same temperature, confirming they reach the sampler. Refs: Blaizzy#1021
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1021.
What
Surface
presence_penalty,frequency_penalty, and per-penalty*_context_sizefields on the/chat/completionsand/responsesendpoints, and forward them tomake_logits_processors.mlx_lm.sample_utils.make_logits_processorsalready supports these — they were just never plumbed through.No default behaviour changes; every new field defaults to
None, somake_logits_processorsfalls back to its existing defaults if a caller omits them. The PR is purely additive surface area.Why
While debugging GLM-OCR-bf16 (#1021), I confirmed end-to-end with debug instrumentation that
repetition_penaltyis applied through the sampler — but the multiplicative penalty discounts a token once per unique appearance in the recent window (the Keskar 2019 formulation), not per occurrence. For VLMs whose logits are sharply peaked (OCR pages, classifier-style outputs),repetition_penalty=1.1(the typical text-LLM default) is below the threshold needed to move the argmax attemperature=0, and the model gets stuck looping a short heading + paragraph dozens of times untilmax_tokenscuts it off.Empirical reproduction on a single page of English prose (image-only PDF, 200 DPI),
temperature=0,max_tokens=4096:repetition_penaltyfrequency_penalty=0.5frequency_penalty=1.0presence_penalty=0.5Ground-truth Ollama/llama.cpp on the same image: ~1,600 chars, 1 occurrence.
The point is not to argue for a particular default — it's that
frequency_penaltyandpresence_penaltyare the standard knobs for this failure mode, and they're already supported by the underlying sampler. Surfacing them lets callers tune for VLM/OCR workloads without resorting to extremerepetition_penaltyvalues that suppress too aggressively.Changes
mlx_vlm/server.py— extendGenerationParamswithrepetition_context_size,presence_penalty,presence_context_size,frequency_penalty,frequency_context_size(allOptional[…] = None), include them inshared_generation_kwargs. Updated therepetition_penaltydescription to flag the rp ≥ 2.0 caveat for VLMs. The new fields are inherited by bothChatRequestandGenerationRequest(used by/responses).mlx_vlm/generate.py— extendgenerate_stepkeyword-only signature with the four new parameters, plumb them through tomake_logits_processors. Built the processor kwargs as an explicit dict so unset fields are omitted rather than passed asNone(preserving mlx_lm's positional defaults). Updated docstrings on bothgenerate_stepandgenerate.mlx_vlm/tests/test_server.py— extendtest_chat_completions_endpoint_forwards_explicit_sampling_argsto assert the new fields reachgenerate(). (Could also extendtest_responses_endpoint_forwards_new_sampling_argsfor symmetry — happy to add if you'd prefer.)Verification
Tested on Apple Silicon (Mac Mini, mlx-vlm installed from the patched branch) against
mlx-community/GLM-OCR-bf16. With debug logging, all five new params are received correctly bymake_logits_processors. Different combinations of the new params produce visibly different outputs attemperature=0, confirming they reach the sampler.repetition_penalty=2.0alone gives a clean (1,188 chars) transcription on the previously-looping page.frequency_penaltycombined with a milderrepetition_penaltymeasurably reduces but does not fully eliminate looping at the values tested — so the new knob is a genuine tool, not a free lunch, but it's the one users currently can't reach.Compatibility
Pure addition. Existing requests that don't set the new fields behave identically. The only behavioural change is the new field appearing in the OpenAPI schema. No version bump required.
Notes
Happy to iterate on:
repetition_penalty's default for known VLM model types (probably not — the failure surface differs by model).repetition_penaltydescription copy is too opinionated — feel free to trim.Thanks for maintaining this project. The Apple Silicon VLM ecosystem benefits enormously from it.