Enable ACPAgent on RemoteRuntime API via ACP conversation endpoints#2465
Enable ACPAgent on RemoteRuntime API via ACP conversation endpoints#2465simonrosenberg merged 8 commits intomainfrom
Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Coverage Report •
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Taste Rating: Acceptable
Overall Assessment: Clean separation of contracts (v1 vs ACP) with good backward compatibility strategy. The data structure choices (discriminated unions with Tags) are solid. However, this PR changes agent behavior in ways that will definitely affect eval benchmarks, so it needs human review before merge.
Key Insight: The architecture correctly avoids the "widen the base contract and break everyone" trap by introducing parallel ACP endpoints. But the FinishAction-every-turn semantic change means this PR should not be auto-approved - someone needs to validate benchmark impact.
Verdict: 🔴 Needs human review (eval behavior changes)
openhands-agent-server/openhands/agent_server/conversation_service.py
Outdated
Show resolved
Hide resolved
Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Taste Rating: Acceptable
Overall Assessment: Clean separation of contracts (v1 vs ACP) with good backward compatibility strategy. The data structure choices (discriminated unions with Tags) are solid. However, this PR changes agent behavior in ways that will definitely affect eval benchmarks, so it needs human review before merge.
Key Insight: The architecture correctly avoids the "widen the base contract and break everyone" trap by introducing parallel ACP endpoints. But the ACPAgent's turn-based semantics (emitting FinishAction after each step) is a fundamental behavioral difference from the base Agent that must be verified against benchmarks before shipping.
Critical Review Points
🟠 Important - Eval Risk (.github/workflows/run-eval.yml)
This PR adds agent_type parameter to the eval workflow, which is good infrastructure. However, ACPAgent has fundamentally different step semantics than the base Agent (emits FinishAction after each turn, handles tool calls differently via ACP protocol).
Per repo guidelines: "Do NOT submit an APPROVE review when the PR changes agent behavior or anything that could plausibly affect benchmark/evaluation performance."
Action Required: Human maintainer should run lightweight evals comparing ACPAgent vs base Agent on a small benchmark subset (e.g., 10 SWE-bench instances) before merging. Add the integration-test label to trigger benchmark runs.
🟢 Acceptable - Data Structure (openhands-agent-server/openhands/agent_server/models.py)
Using Discriminator with Tag annotations for ACPEnabledAgent is the right choice for polymorphic agent types. The explicit kind field makes deserialization unambiguous and plays well with OpenAPI schema generation.
This is cleaner than the alternative approaches (manual validation, union without discriminator) and follows Pydantic best practices.
🟡 Suggestion - Type Safety (openhands-agent-server/openhands/agent_server/conversation_service.py)
The cast() usage in search_conversations (line 172) is necessary given the contract constraints, but it's a bit fragile. The type checker trusts you that all items are ConversationInfo.
Alternative: Return a typed tuple from _search_conversations, but this creates its own complexity with union return types. The current approach is pragmatic - just be aware the cast bypasses type safety.
🟢 Acceptable - Semantic Difference (openhands-sdk/openhands/sdk/agent/acp_agent.py)
Emitting FinishAction after each step() is a documented behavioral difference from the base Agent. For ACP, one step = one complete remote assistant turn, so this makes sense.
Good: This was called out in a previous review and is now documented both at the class level and inline.
Eval Impact: This changes the event stream compared to base Agent, which could affect:
- Metrics counting (more finish events)
- Downstream consumers expecting different event patterns
- Benchmark evaluation logic
Must be verified in evals before shipping.
🟡 Suggestion - Test Coverage (tests/agent_server/test_conversation_router_acp.py)
The tests verify the OpenAPI contract (request/response schemas) but don't test real ACP behavior. They mock conversation_service.start_acp_conversation and assert it was called.
What's missing: An integration test that:
- Starts a real (or fake-but-realistic) ACP subprocess
- Creates a conversation via the API
- Sends a message and runs to completion
- Verifies the event stream contains expected ACPToolCallEvents + FinishAction
The example 09_acp_agent_with_remote_runtime.py provides some E2E coverage, but in-tree integration tests would be better for catching regressions.
Acceptable for now if this is meant as a fast follow-up for integration tests, but flag it as tech debt.
Verdict
❌ Needs human review + lightweight evals - Not because the code is broken, but because agent behavior changes require eval verification per repo policy.
Next Steps:
- Add
integration-testlabel to trigger benchmark runs - Run lightweight eval comparing ACPAgent vs base Agent on ~10 SWE-bench instances
- Review benchmark results for unexpected regressions
- If benchmarks look good, merge with confidence
|
(OpenHands-GPT-5.4) I reviewed the PR description and diff end-to-end. Overall, the route split looks like the right way to replay #2190 without widening the legacy One thing I noticed that seems worth clarifying:
So after this change, ACP conversations appear to become invisible to the only count endpoint we expose, even though counting itself does not require the legacy Was that exclusion intentional? If not, I think the simplest fix would be to keep |
|
(OpenHands-GPT-5.4) Here’s a fuller review summary of PR #2465. Overall assessmentI think the core design is good. It directly addresses the real regression from #2190:
That is a much cleaner approach than reintroducing ACP support by changing the existing What the PR does1. Adds a parallel ACP conversation APINew routes under:
These are the four operations that were affected by the 2. Keeps legacy v1 REST contract stableThe existing
So legacy clients can still:
3. Reintroduces ACPAgent remote-runtime supportThe PR replays the useful ACP runtime work from #2190, including:
4. Updates
|
|
HUMAN: one tiny thing: maybe we shouldn’t use “legacy” for the name of the Agent endpoints; they’re not legacy, they’re … standard? regular? 😅 Update: actually, not sure. Since the new endpoints encode the discriminator, maybe we could use them at some point going forward, instead of the old ones? That would mean to deprecate the existing endpoints for a time, then remove them, and keep only the acp/ ones. (if I saw this right that it will work for regular conversations). 🤔 But unless we have strong feelings about this, I think... I think right now I wouldn't. The existing endpoints work, they're essential, they're well known, they don't have |
ACP Validation Results ✅Successfully validated the ACP implementation with benchmarks using:
Results
ACP Evidence
commit0 NoteThe 2 incomplete commit0 instances failed due to MCP JSONRPC parsing failures (infrastructure issue), not ACP bugs: This is a runtime pod stability issue unrelated to the ACP implementation. Workflow Run IDs
Conclusion: The ACP implementation is validated and working correctly. ✅ |
|
Following up on point 4 from my previous summary (“Model duplication maintenance cost”): I think there is a lower-duplication alternative here if the main goal is just to keep the REST Instead of introducing class ConversationInfo(ConversationState):
agent: Agent
title: str | None = None
metrics: MetricsSnapshot | None = None
created_at: datetime
updated_at: datetime
class ACPConversationInfo(ConversationState):
agent: ACPEnabledAgent
title: str | None = None
metrics: MetricsSnapshot | None = None
created_at: datetime
updated_at: datetimeI sanity-checked this locally in a small schema probe, and Pydantic does narrow the inherited That would keep the important separation:
but avoids having to manually re-copy the rest of the conversation-state surface into Tradeoff: this keeps the API DTOs more tightly coupled to |
|
HUMAN: just a quick pass through the agent's posts, it seems maybe we want to look at these just to be sure it is what we want: @simonrosenberg
I'm not sure it's an issue, WDYT? #2465 (comment)
The code suggests a potential bad path for “regular client tries to attach to existing ACP conversation ID”. Maybe we could reject it explicitly (with an error) than fail via assertion.
This is OK, we'll deal with it
WDYT of this? #2465 (comment)
This is OK |
|
@enyst |
Co-authored-by: openhands <openhands@all-hands.dev>
I implemented 1 and 2. Making a PR in the docs right now + making an issue for 4. Also I am triggering small benchmarks with the latest code |
Co-authored-by: openhands <openhands@all-hands.dev>
Summary
main/api/conversationsREST contract stable and Agent-only/api/acp/conversationsRemoteConversationto the ACP endpoints only for ACP conversation creation and conversation-info readsWhy this approach
#2190 was useful, but it widened the public v1 REST contract by registering
ACPAgentat theAgentBaseboundary. That changed the OpenAPI schema and caused older clients that posted a plainagentobject withoutkindto start failing with422.This PR keeps the legacy contract intact and limits the new polymorphic contract to the endpoints that actually need it:
POST /api/acp/conversationsGET /api/acp/conversations/{conversation_id}GET /api/acp/conversationsGET /api/acp/conversations/searchEverything else stays on the legacy
/api/conversationssurface.Included from #2190
Compatibility notes
/api/conversationsremains backward-compatible and keeps the oldAgentrequest/response shape/api/acp/conversationsRemoteConversationuses ACP endpoints only where the contract differs; events, run, pause, secrets, and related actions stay on the existing routesConversationInfopayload for legacyAgentconversations and use the ACP shape only for ACP conversationsTest plan
uv run pytest tests/agent_server/test_conversation_router.py tests/agent_server/test_conversation_router_acp.py tests/agent_server/test_openapi_discriminator.py tests/sdk/conversation/remote/test_remote_conversation.py tests/sdk/agent/test_acp_agent.pyuv run pytest tests/sdk/agent/test_acp_agent.py tests/agent_server/test_conversation_service.pyuv run pre-commit run --files openhands-agent-server/openhands/agent_server/api.py openhands-agent-server/openhands/agent_server/conversation_router_acp.py openhands-agent-server/openhands/agent_server/conversation_service.py openhands-agent-server/openhands/agent_server/models.py openhands-sdk/openhands/sdk/agent/acp_agent.py openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py tests/agent_server/test_conversation_router.py tests/agent_server/test_conversation_router_acp.py tests/agent_server/test_conversation_service.py tests/agent_server/test_openapi_discriminator.py tests/sdk/agent/test_acp_agent.py tests/sdk/conversation/remote/test_remote_conversation.pySupersedes the accidentally closed #2461.
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:37a17aa-pythonRun
All tags pushed for this build
About Multi-Architecture Support
37a17aa-python) is a multi-arch manifest supporting both amd64 and arm6437a17aa-python-amd64) are also available if neededDocs