Skip to content

WIP: Additional evals for info-gathering agent skills#855

Draft
agentydragon wants to merge 1 commit intodevelfrom
claude/squash-rebase-infogathering-hsykF
Draft

WIP: Additional evals for info-gathering agent skills#855
agentydragon wants to merge 1 commit intodevelfrom
claude/squash-rebase-infogathering-hsykF

Conversation

@agentydragon
Copy link
Copy Markdown
Owner

Summary

This PR adds a comprehensive evaluation framework for testing information-gathering agent skills, including four distinct evaluation scenarios (movies, medical diagnosis, tech support, and apartment search) along with supporting simulator configurations.

Key Changes

  • Evaluation Harness Framework: Added reusable evaluation infrastructure (harness.py) that provides:

    • Common argument parsing and client setup
    • Token tracking and cost calculation
    • Tool call resolution and response logging
    • Conversation evaluation runner for multi-turn interactions
  • Four Evaluation Scenarios:

    • Movies (movies.py): Tests movie recommendation optimization where an agent recommends movies to maximize discounted enjoyment score using a recommend_movie tool
    • Medical (medical.py): Diagnostic reasoning with two variants (IIH and GERD) testing history-taking and test ordering efficiency
    • Tech Anchoring (tech_anchoring.py): Tests ability to diagnose tech problems and avoid anchoring bias (ISP outage vs. browser extension)
    • Apartments (apartments.py): Tests preference elicitation and ranking for apartment selection
  • Simulator Configurations: Added detailed simulator prompts for each scenario that:

    • Define patient/user personas with hidden ground truth
    • Specify symptom/preference responses and test results with costs
    • Include scoring logic and success criteria
    • Support natural language interaction patterns
  • Build Configuration: Added BUILD.bazel files marking evaluations as disabled (.py.ignore extension) pending integration

Implementation Details

  • All evaluation scripts follow a consistent pattern: load skill text, build agent system prompt, run conversation loop with tool handling, compute results, and save logs
  • Simulators provide realistic responses with cost tracking (medical tests, tool usage) to evaluate efficiency
  • Scoring varies by scenario: discounted sum (movies), cost minimization (medical), binary correctness (tech), preference recovery (apartments)
  • Support for extended thinking via thinking_budget parameter for all evaluations

https://claude.ai/code/session_01JEBrBc2m9q7b2MgL5p4dJ1

Squashed from 3 commits:
- Infogathering skill WIP
- Eval: rewrite 20Q to LiteLLM, native tool calling, type fixes (#843)
- Add async support and scratch container tools to 20 Questions eval (#848)
@agentydragon agentydragon changed the title Add evaluation harness for info-gathering agent skills WIP: Additional evals for info-gathering agent skills Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants