feat: OpenAI Responses API with structured tool calling and multi-turn support#996
Open
eloe wants to merge 3 commits intoBlaizzy:mainfrom
Open
feat: OpenAI Responses API with structured tool calling and multi-turn support#996eloe wants to merge 3 commits intoBlaizzy:mainfrom
eloe wants to merge 3 commits intoBlaizzy:mainfrom
Conversation
Add full OpenAI Responses API (/v1/responses) compliance including: - Structured function_call output items (parsed from model text) - function_call_output input items for multi-turn tool use - previous_response_id with LRU response store (256 entries) - instructions field with developer-to-system role normalization - "text" type alias accepted alongside "input_text" - tools/tool_choice passthrough to chat template and response echo - Streaming SSE with sequence_number and [DONE] sentinel - incomplete_details for length-truncated responses - parallel_tool_calls, metadata field support New files: - responses_models.py: Self-contained Pydantic models for Responses API - responses_store.py: Thread-safe LRU store for response replay - tests/test_responses_api.py: 31 tests (models, store, endpoint, streaming) Reference: OpenAI Responses API spec and waybarrios/vllm-mlx#214 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the model generates <tool_call>...</tool_call> markup during streaming, detect the tag and suppress those tokens from being sent as response.output_text.delta events. This prevents raw tool call XML from being displayed to users (e.g., in Telegram via OpenClaw). The tool call is still parsed and emitted as structured function_call events after generation completes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Critical fixes:
- Add recursion depth limit (50) and cycle detection for
previous_response_id chains to prevent stack overflow/DoS
- Fix tool call suppression: in_tool_call now resets on end tag,
partial-match only triggers on actual trailing prefix (not any <)
- Sanitize error messages in streaming — no longer leaks internals
Security fixes:
- Validate image URLs are non-empty before appending (prevents
load_image('') → 500 errors)
Spec compliance:
- Remove data: [DONE] sentinel (not part of Responses API spec,
response.completed is the terminal event)
Code quality:
- Fix mutable default lists in Pydantic models (use default_factory)
- Log tool parsing failures instead of silently swallowing
- Collect multi-part assistant messages into single message in replay
- Shorten import aliases to avoid unnecessary verbosity
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the OpenAI Responses API (
/v1/responses), enabling agent frameworks that target this newer API surface to work with mlx-vlm. This is a growing requirement as frameworks like OpenAI Agents SDK and others migrate from Chat Completions to the Responses API.Features
Request handling:
inputas a string or structured message list (text, image, function_call_output)instructionsfield mapped to system messageprevious_response_idfor multi-turn conversation replay via in-memory LRU store (256 entries)developerrole mapped tosystem(OpenAI convention)Response format:
ResponseObjectwithoutput[]containingmessageandfunction_callitemsusagewithinput_tokens,output_tokens,total_tokens, andinput_tokens_details.cached_tokensstatus:completedorincomplete(withincomplete_details.reason)Streaming (SSE):
Complete event sequence per the spec:
response.createdresponse.in_progressresponse.output_item.addedresponse.content_part.addedresponse.output_text.delta(per token)response.output_text.doneresponse.content_part.doneresponse.output_item.doneresponse.function_call_arguments.delta/.done(when tools detected)response.completedTool calling:
_infer_tool_parser,process_tool_calls)ResponseFunctionCallItemoutput itemsArchitecture
responses_models.pyresponses_store.pyprevious_response_idreplayprompt_utils.py_normalize_tools()for Responses-to-Chat tool format bridgingserver.pyTests
40 tests across 8 categories: