feat: OpenAI Responses API with structured tool calling and multi-turn support by eloe · Pull Request #996 · Blaizzy/mlx-vlm

eloe · 2026-04-09T03:49:08Z

Summary

Implements the OpenAI Responses API (/v1/responses), enabling agent frameworks that target this newer API surface to work with mlx-vlm. This is a growing requirement as frameworks like OpenAI Agents SDK and others migrate from Chat Completions to the Responses API.

Features

Request handling:

Accepts input as a string or structured message list (text, image, function_call_output)
instructions field mapped to system message
previous_response_id for multi-turn conversation replay via in-memory LRU store (256 entries)
developer role mapped to system (OpenAI convention)
Tool definitions in flat Responses format automatically normalized to nested Chat Completions format for Jinja template compatibility

Response format:

Full ResponseObject with output[] containing message and function_call items
usage with input_tokens, output_tokens, total_tokens, and input_tokens_details.cached_tokens
status: completed or incomplete (with incomplete_details.reason)

Streaming (SSE):
Complete event sequence per the spec:

response.created
response.in_progress
response.output_item.added
response.content_part.added
response.output_text.delta (per token)
response.output_text.done
response.content_part.done
response.output_item.done
response.function_call_arguments.delta / .done (when tools detected)
response.completed

Tool calling:

Reuses the existing tool parser infrastructure (_infer_tool_parser, process_tool_calls)
Streaming suppresses tool-call markup tokens from text deltas
Function call results emitted as ResponseFunctionCallItem output items

Architecture

File	Purpose
`responses_models.py`	Pydantic models for request/response/streaming events
`responses_store.py`	LRU store for `previous_response_id` replay
`prompt_utils.py`	`_normalize_tools()` for Responses-to-Chat tool format bridging
`server.py`	Endpoint handler, streaming generator

Tests

40 tests across 8 categories:

Models (A): Request/response schema validation, computed properties
Store (B): LRU eviction, save/get/replay, conversation reconstruction
Endpoints (C): Basic text, message lists, instructions, tools, developer role, function_call_output
Streaming (D): SSE event sequence, DONE sentinel
Prompt Cache (E): Cache state wiring for both endpoints
Concurrency (F): Semaphore acquisition in responses path
Finish Reason (G): Correct status for tool calls vs plain text
JSON Mode (H): response_format acceptance

python -m pytest mlx_vlm/tests/test_responses_api.py -v

Add full OpenAI Responses API (/v1/responses) compliance including: - Structured function_call output items (parsed from model text) - function_call_output input items for multi-turn tool use - previous_response_id with LRU response store (256 entries) - instructions field with developer-to-system role normalization - "text" type alias accepted alongside "input_text" - tools/tool_choice passthrough to chat template and response echo - Streaming SSE with sequence_number and [DONE] sentinel - incomplete_details for length-truncated responses - parallel_tool_calls, metadata field support New files: - responses_models.py: Self-contained Pydantic models for Responses API - responses_store.py: Thread-safe LRU store for response replay - tests/test_responses_api.py: 31 tests (models, store, endpoint, streaming) Reference: OpenAI Responses API spec and waybarrios/vllm-mlx#214 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When the model generates <tool_call>...</tool_call> markup during streaming, detect the tag and suppress those tokens from being sent as response.output_text.delta events. This prevents raw tool call XML from being displayed to users (e.g., in Telegram via OpenClaw). The tool call is still parsed and emitted as structured function_call events after generation completes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Critical fixes: - Add recursion depth limit (50) and cycle detection for previous_response_id chains to prevent stack overflow/DoS - Fix tool call suppression: in_tool_call now resets on end tag, partial-match only triggers on actual trailing prefix (not any <) - Sanitize error messages in streaming — no longer leaks internals Security fixes: - Validate image URLs are non-empty before appending (prevents load_image('') → 500 errors) Spec compliance: - Remove data: [DONE] sentinel (not part of Responses API spec, response.completed is the terminal event) Code quality: - Fix mutable default lists in Pydantic models (use default_factory) - Log tool parsing failures instead of silently swallowing - Collect multi-part assistant messages into single message in replay - Shorten import aliases to avoid unnecessary verbosity Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eloe and others added 3 commits April 8, 2026 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: OpenAI Responses API with structured tool calling and multi-turn support#996

feat: OpenAI Responses API with structured tool calling and multi-turn support#996
eloe wants to merge 3 commits intoBlaizzy:mainfrom
eloe:upstream/responses-api

eloe commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

eloe commented Apr 9, 2026

Summary

Features

Architecture

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant