WIP: Additional evals for info-gathering agent skills by agentydragon · Pull Request #855 · agentydragon/ducktape

agentydragon · 2026-03-16T02:52:27Z

Summary

This PR adds a comprehensive evaluation framework for testing information-gathering agent skills, including four distinct evaluation scenarios (movies, medical diagnosis, tech support, and apartment search) along with supporting simulator configurations.

Key Changes

Evaluation Harness Framework: Added reusable evaluation infrastructure (harness.py) that provides:
- Common argument parsing and client setup
- Token tracking and cost calculation
- Tool call resolution and response logging
- Conversation evaluation runner for multi-turn interactions
Four Evaluation Scenarios:
- Movies (movies.py): Tests movie recommendation optimization where an agent recommends movies to maximize discounted enjoyment score using a recommend_movie tool
- Medical (medical.py): Diagnostic reasoning with two variants (IIH and GERD) testing history-taking and test ordering efficiency
- Tech Anchoring (tech_anchoring.py): Tests ability to diagnose tech problems and avoid anchoring bias (ISP outage vs. browser extension)
- Apartments (apartments.py): Tests preference elicitation and ranking for apartment selection
Simulator Configurations: Added detailed simulator prompts for each scenario that:
- Define patient/user personas with hidden ground truth
- Specify symptom/preference responses and test results with costs
- Include scoring logic and success criteria
- Support natural language interaction patterns
Build Configuration: Added BUILD.bazel files marking evaluations as disabled (.py.ignore extension) pending integration

Implementation Details

All evaluation scripts follow a consistent pattern: load skill text, build agent system prompt, run conversation loop with tool handling, compute results, and save logs
Simulators provide realistic responses with cost tracking (medical tests, tool usage) to evaluate efficiency
Scoring varies by scenario: discounted sum (movies), cost minimization (medical), binary correctness (tech), preference recovery (apartments)
Support for extended thinking via thinking_budget parameter for all evaluations

https://claude.ai/code/session_01JEBrBc2m9q7b2MgL5p4dJ1

Squashed from 3 commits: - Infogathering skill WIP - Eval: rewrite 20Q to LiteLLM, native tool calling, type fixes (#843) - Add async support and scratch container tools to 20 Questions eval (#848)

Infogathering skill with 20 Questions evaluation

517d4d8

Squashed from 3 commits: - Infogathering skill WIP - Eval: rewrite 20Q to LiteLLM, native tool calling, type fixes (#843) - Add async support and scratch container tools to 20 Questions eval (#848)

agentydragon changed the title ~~Add evaluation harness for info-gathering agent skills~~ WIP: Additional evals for info-gathering agent skills Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Additional evals for info-gathering agent skills#855

WIP: Additional evals for info-gathering agent skills#855
agentydragon wants to merge 1 commit intodevelfrom
claude/squash-rebase-infogathering-hsykF

agentydragon commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

agentydragon commented Mar 16, 2026

Summary

Key Changes

Implementation Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants