Skip to content

feat: OpenAI Responses API with structured tool calling and multi-turn support#996

Open
eloe wants to merge 3 commits intoBlaizzy:mainfrom
eloe:upstream/responses-api
Open

feat: OpenAI Responses API with structured tool calling and multi-turn support#996
eloe wants to merge 3 commits intoBlaizzy:mainfrom
eloe:upstream/responses-api

Conversation

@eloe
Copy link
Copy Markdown

@eloe eloe commented Apr 9, 2026

Summary

Implements the OpenAI Responses API (/v1/responses), enabling agent frameworks that target this newer API surface to work with mlx-vlm. This is a growing requirement as frameworks like OpenAI Agents SDK and others migrate from Chat Completions to the Responses API.

Features

Request handling:

  • Accepts input as a string or structured message list (text, image, function_call_output)
  • instructions field mapped to system message
  • previous_response_id for multi-turn conversation replay via in-memory LRU store (256 entries)
  • developer role mapped to system (OpenAI convention)
  • Tool definitions in flat Responses format automatically normalized to nested Chat Completions format for Jinja template compatibility

Response format:

  • Full ResponseObject with output[] containing message and function_call items
  • usage with input_tokens, output_tokens, total_tokens, and input_tokens_details.cached_tokens
  • status: completed or incomplete (with incomplete_details.reason)

Streaming (SSE):
Complete event sequence per the spec:

  1. response.created
  2. response.in_progress
  3. response.output_item.added
  4. response.content_part.added
  5. response.output_text.delta (per token)
  6. response.output_text.done
  7. response.content_part.done
  8. response.output_item.done
  9. response.function_call_arguments.delta / .done (when tools detected)
  10. response.completed

Tool calling:

  • Reuses the existing tool parser infrastructure (_infer_tool_parser, process_tool_calls)
  • Streaming suppresses tool-call markup tokens from text deltas
  • Function call results emitted as ResponseFunctionCallItem output items

Architecture

File Purpose
responses_models.py Pydantic models for request/response/streaming events
responses_store.py LRU store for previous_response_id replay
prompt_utils.py _normalize_tools() for Responses-to-Chat tool format bridging
server.py Endpoint handler, streaming generator

Tests

40 tests across 8 categories:

  • Models (A): Request/response schema validation, computed properties
  • Store (B): LRU eviction, save/get/replay, conversation reconstruction
  • Endpoints (C): Basic text, message lists, instructions, tools, developer role, function_call_output
  • Streaming (D): SSE event sequence, DONE sentinel
  • Prompt Cache (E): Cache state wiring for both endpoints
  • Concurrency (F): Semaphore acquisition in responses path
  • Finish Reason (G): Correct status for tool calls vs plain text
  • JSON Mode (H): response_format acceptance
python -m pytest mlx_vlm/tests/test_responses_api.py -v

eloe and others added 3 commits April 8, 2026 20:32
Add full OpenAI Responses API (/v1/responses) compliance including:

- Structured function_call output items (parsed from model text)
- function_call_output input items for multi-turn tool use
- previous_response_id with LRU response store (256 entries)
- instructions field with developer-to-system role normalization
- "text" type alias accepted alongside "input_text"
- tools/tool_choice passthrough to chat template and response echo
- Streaming SSE with sequence_number and [DONE] sentinel
- incomplete_details for length-truncated responses
- parallel_tool_calls, metadata field support

New files:
- responses_models.py: Self-contained Pydantic models for Responses API
- responses_store.py: Thread-safe LRU store for response replay
- tests/test_responses_api.py: 31 tests (models, store, endpoint, streaming)

Reference: OpenAI Responses API spec and waybarrios/vllm-mlx#214

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the model generates <tool_call>...</tool_call> markup during
streaming, detect the tag and suppress those tokens from being sent
as response.output_text.delta events. This prevents raw tool call
XML from being displayed to users (e.g., in Telegram via OpenClaw).

The tool call is still parsed and emitted as structured function_call
events after generation completes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Critical fixes:
- Add recursion depth limit (50) and cycle detection for
  previous_response_id chains to prevent stack overflow/DoS
- Fix tool call suppression: in_tool_call now resets on end tag,
  partial-match only triggers on actual trailing prefix (not any <)
- Sanitize error messages in streaming — no longer leaks internals

Security fixes:
- Validate image URLs are non-empty before appending (prevents
  load_image('') → 500 errors)

Spec compliance:
- Remove data: [DONE] sentinel (not part of Responses API spec,
  response.completed is the terminal event)

Code quality:
- Fix mutable default lists in Pydantic models (use default_factory)
- Log tool parsing failures instead of silently swallowing
- Collect multi-part assistant messages into single message in replay
- Shorten import aliases to avoid unnecessary verbosity

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant