Skip to content

Add AI SDK E2E integration tests#4390

Closed
bendrucker wants to merge 13 commits intopydantic:mainfrom
bendrucker:ai-sdk-test-e2e
Closed

Add AI SDK E2E integration tests#4390
bendrucker wants to merge 13 commits intopydantic:mainfrom
bendrucker:ai-sdk-test-e2e

Conversation

@bendrucker
Copy link
Contributor

@bendrucker bendrucker commented Feb 20, 2026

E2E integration tests for AI SDK v6 against a real VercelAIAdapter server, covering text, thinking, tool calls, multi-tool, and the full tool approval lifecycle.

Three of the four approval tests reproduce #4387 and fail without #4388.

Note: This branch is based on fix-ai-sdk-v6-approval-types. I'll rebase onto main after #4388 merges to remove those changes from the diff.

Changes

  • tests/ai_sdk/server.py: Per-agent Starlette server with a single /api/chat route, selected via CLI arg
  • tests/ai_sdk/helpers.ts: TestChat wrapping AI SDK's DefaultChatTransport at /api/chat
  • tests/ai_sdk/test_*.ts: TypeScript tests using AbstractChat for each scenario
  • tests/ai_sdk/test_ai_sdk.py: pytest orchestration that starts a server per test, with glob-based test discovery and agent/test file validation
  • tests/ai_sdk/package.json: ai and @types/node dependencies

Testing

uv run pytest tests/ai_sdk/test_ai_sdk.py -xvs

References

Pre-Review Checklist

  • Any AI generated code has been reviewed line-by-line by the human PR author, who stands by it.
  • No breaking changes in accordance with the version policy.
  • Linting and type checking pass per make format and make typecheck.
  • PR title is fit for the release changelog.

Pre-Merge Checklist

  • New tests for any fix or new behavior, maintaining 100% coverage.
  • Updated documentation for new features and behaviors, including docstrings for API docs.

@github-actions github-actions bot added size: M Medium PR (101-500 weighted lines) chore labels Feb 20, 2026
@bendrucker bendrucker changed the title Add AI SDK E2E integration tests for tool approval lifecycle Add AI SDK E2E integration tests Feb 20, 2026
@bendrucker
Copy link
Contributor Author

Putting this up for discussion before I spend more time cleaning up any of the messy test lifecycle bits, especially server startup/teardown.

@github-actions github-actions bot added size: L Large PR (501-1500 weighted lines) and removed size: M Medium PR (101-500 weighted lines) labels Feb 24, 2026
The AI SDK v6 tool approval flow uses three additional tool part states
(approval-requested, approval-responded, output-denied) that were not
defined in request_types.py. When the SDK client sent messages with
these states, Pydantic validation rejected them.

Add 6 new models (3 static + 3 dynamic) for the missing states, update
the ToolUIPart/DynamicToolUIPart unions, register them in
_TOOL_PART_TYPES, and handle output-denied as a terminal state in
load_messages().
Narrow the isinstance check to only approval-responded part types.
Output-denied parts are already materialized by load_messages and
must not be re-processed as deferred results.
Reproduces pydantic#4387: the Pydantic AI server rejects messages with
approval-requested, approval-responded, and output-denied tool states
that the AI SDK v6 client produces during the tool approval lifecycle.

The harness starts a real Starlette server with VercelAIAdapter and
drives it from a node:test script using the AI SDK's AbstractChat.
Rename tests/js_integration/ to tests/ai_sdk/, convert to TypeScript,
and add test cases for approval, denial, and denial with reason.
…narios

Split tests into separate files per scenario with shared helpers. Use
TestModel for natural agent behavior instead of hand-crafted stream deltas.
Exclude tests/ai_sdk from normal pytest collection via norecursedirs.
- Use filter+find instead of for loop in tool approval test
- Use every() instead of filter+count in multi-tool test
- Use Array.prototype.with() in replaceMessage
Server takes agent name as CLI arg and serves it at /api/chat.
TestChat hardcodes the transport URL, so tests just use new TestChat().
- server_url fixture is now function-scoped, starts a server per test
  with the agent name derived from the test filename
- Discover test files via glob instead of hardcoded list
- Add test_agents_match_test_files to catch mismatches
Make the approval agent retry after first denial so multi-step
denial paths are exercised. Add tests for deny-retry-approve and
deny-retry-deny flows. These tests currently fail, reproducing the
bug where iter_tool_approval_responses yields output-denied parts.
@DouweM
Copy link
Collaborator

DouweM commented Mar 4, 2026

@bendrucker Nice! I agree something like this is very much worth having.

Would it make sense to live in or use https://github.com/pydantic/ai-chat-ui, though? That's the frontend we use for https://ai.pydantic.dev/web/. We could have a CI task in this pydantic-ai repo that triggers a build on that repo pointing at the latest pydantic-ai.

But I'm not a frontend developer and there may be a good reason why that won't work right, and this should really live on the Pydantic AI side using a separate test harness than that real app.

@bendrucker
Copy link
Contributor Author

Would it make sense to live in or use https://github.com/pydantic/ai-chat-ui, though?

Yeah I could see that. That then opens up 2 avenues for testing:

  1. Running headless tests in Node.js (this)
  2. Automated browser testing of the AI Chat UI app (https://playwright.dev/)

If this moves to the ai-chat-ui repo, 1 can slim down and focus on the more specific integration tests (e.g., various combinations of tool approval ordering) while 2 covers the overall "does it work."

@DouweM
Copy link
Collaborator

DouweM commented Mar 6, 2026

@bendrucker That sounds reasonable, would you be up for implementing (some subset of) that there? :)

@bendrucker
Copy link
Contributor Author

Yes, on it!

Opened pydantic/ai-chat-ui#16, and #4670 to point to it.

@bendrucker bendrucker closed this Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore size: L Large PR (501-1500 weighted lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants