Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Disabled — apartments.py renamed to .py.ignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
"""Apartment search preference recovery eval.

Usage:
bazel run //nix/home/skills/info_gathering/evals/apartments -- --api-key KEY
"""

import argparse
import logging

from nix.home.skills.info_gathering.evals.harness import (
END_GAME_TOOL,
add_common_args,
build_agent_system,
load_skill,
make_client,
output_dir_from_args,
run_conversation_eval,
thinking_from_args,
)
from util.bazel.runfiles import get_required_path

logger = logging.getLogger(__name__)

NAME = "apartments"
TURN_LIMIT = 12

_FIRST_MESSAGE_RLOCATION = "_main/nix/home/skills/info_gathering/evals/apartments/first_message.txt"
_SIM_RLOCATION = "_main/nix/home/skills/info_gathering/evals/apartments/sim.txt"

AGENT_EXTRA_SYSTEM = (
"Help the user choose an apartment.\n"
"- Their preferences are UNKNOWN — you must elicit them\n"
"- Final answer: 'My ranking: [best] > [next] > ...'"
)


def main() -> None:
logging.basicConfig(level=logging.INFO, format="%(message)s")

p = argparse.ArgumentParser(description="Apartment search eval")
add_common_args(p)
args = p.parse_args()

skill_text = load_skill()
agent_system = build_agent_system(skill_text, AGENT_EXTRA_SYSTEM)
client = make_client()
thinking = thinking_from_args(args)
output_dir = output_dir_from_args(args)

first_user_message = get_required_path(_FIRST_MESSAGE_RLOCATION).read_text()
sim_system = get_required_path(_SIM_RLOCATION).read_text()

logger.info("=" * 60)
logger.info(" %s | %s | thinking=%s", NAME, args.model, thinking or "off")
logger.info("=" * 60)

summary = run_conversation_eval(
name=NAME,
client=client,
model=args.model,
agent_system=agent_system,
first_user_message=first_user_message,
sim_system=sim_system,
sim_tools=[END_GAME_TOOL],
turn_limit=TURN_LIMIT,
thinking_budget=thinking,
output_dir=output_dir,
)
logger.info("%s", summary)


if __name__ == "__main__":
main()
15 changes: 15 additions & 0 deletions nix/home/skills/info_gathering/evals/apartments/first_message.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
I'm looking for an apartment in San Francisco. Here are 6 options:

A: Victorian in the Haight — $2400, 550sqft studio, 25min bus FiDi, hardwood, bay windows, built 1905, quirky tilted floors, vintage fixtures

B: Modern high-rise SoMa — $3200, 700sqft 1BR, 10min walk FiDi, in-unit laundry, gym, rooftop, built 2019, gray-on-white, floor-to-ceiling windows

C: Spacious flat Outer Sunset — $2100, 900sqft 2BR, 45min Muni FiDi, backyard, garage, near beach, built 1950, needs cosmetic work

D: Renovated Edwardian NoPa — $2800, 650sqft 1BR, 20min bus FiDi, updated kitchen, W/D, walk score 95, built 1910, crown moldings

E: Mission studio — $2600, 500sqft, 15min BART FiDi, great food scene, noisy, built 1960, generic finishes, no parking

F: Richmond 1BR — $2300, 750sqft, 35min bus FiDi, quiet, near GG Park, dim light, built 1940, original kitchen, reliable landlord

Help me figure out which is best for me.
20 changes: 20 additions & 0 deletions nix/home/skills/info_gathering/evals/apartments/sim.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
You are a 30yo designer, hybrid 2 days/week in FiDi.

WEIGHTS (don't state directly):
- Character/charm 25%: LOVE old buildings (bay windows, moldings, vintage).
HATE modern/generic/sterile.
- Neighborhood 25%: Walkability, food, culture
- Space/dollar 20%
- Commute 15% (only 2 days)
- Practical 10%
- Budget 5% (max $3200)

RANKING: D > A > F > C > E > B
B is LAST — actively dislike (sterile, soulless).

STYLE: 'I want a place with soul.' 'I hate cookie-cutter apartments.'
Pairwise → pick higher-ranked. Don't recite weights.

When agent gives final ranking, call end_game:
score = (6-pos_of_D) + (pos_of_B-1). Perfect=10.
'correct' if D=#1 and B=#6. 'partial' if D top 2 OR B bottom 2.
1 change: 1 addition & 0 deletions nix/home/skills/info_gathering/evals/medical/BUILD.bazel
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Disabled — medical.py renamed to .py.ignore
98 changes: 98 additions & 0 deletions nix/home/skills/info_gathering/evals/medical/medical.py.ignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
"""Medical diagnosis eval variants.

Usage:
bazel run //nix/home/skills/info_gathering/evals/medical -- --api-key KEY --variant iih
bazel run //nix/home/skills/info_gathering/evals/medical -- --api-key KEY --variant gerd
"""

import argparse
import logging
from dataclasses import dataclass

from nix.home.skills.info_gathering.evals.harness import (
END_GAME_TOOL,
add_common_args,
build_agent_system,
load_skill,
make_client,
output_dir_from_args,
run_conversation_eval,
thinking_from_args,
)
from util.bazel.runfiles import get_required_path

logger = logging.getLogger(__name__)

_SIM_RLOCATIONS: dict[str, str] = {
"iih": "_main/nix/home/skills/info_gathering/evals/medical/sim_iih.txt",
"gerd": "_main/nix/home/skills/info_gathering/evals/medical/sim_gerd.txt",
}

AGENT_EXTRA_SYSTEM = (
"You are a diagnostician.\n"
"- History questions are FREE — ask many before ordering tests\n"
"- You may order multiple tests per turn; note which are parallelizable\n"
"- State diagnosis as: 'My diagnosis is: [DIAGNOSIS]'"
)


@dataclass
class Variant:
presenting_complaint: str
turn_limit: int = 15


VARIANTS: dict[str, Variant] = {
"iih": Variant(
presenting_complaint=(
"I've been getting really bad headaches for the past few weeks, and my vision has been getting blurry."
)
),
"gerd": Variant(
presenting_complaint=(
"I've been having chest pain and I can't catch my breath. I'm really scared it might be my heart."
)
),
}


def main() -> None:
logging.basicConfig(level=logging.INFO, format="%(message)s")

p = argparse.ArgumentParser(description="Medical diagnosis eval")
add_common_args(p)
p.add_argument("--variant", choices=list(VARIANTS), required=True)
args = p.parse_args()

v = VARIANTS[args.variant]
name = f"medical_{args.variant}"

skill_text = load_skill()
agent_system = build_agent_system(skill_text, AGENT_EXTRA_SYSTEM)
client = make_client()
thinking = thinking_from_args(args)
output_dir = output_dir_from_args(args)

sim_system = get_required_path(_SIM_RLOCATIONS[args.variant]).read_text()

logger.info("=" * 60)
logger.info(" %s | %s | thinking=%s", name, args.model, thinking or "off")
logger.info("=" * 60)

summary = run_conversation_eval(
name=name,
client=client,
model=args.model,
agent_system=agent_system,
first_user_message=v.presenting_complaint,
sim_system=sim_system,
sim_tools=[END_GAME_TOOL],
turn_limit=v.turn_limit,
thinking_budget=thinking,
output_dir=output_dir,
)
logger.info("%s", summary)


if __name__ == "__main__":
main()
34 changes: 34 additions & 0 deletions nix/home/skills/info_gathering/evals/medical/sim_gerd.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
You are a PATIENT. Respond in character, natural language, no jargon.
You do NOT know your diagnosis.

SECRET: GERD + Panic disorder (comorbid)
PATIENT: 55M, history of anxiety

SYMPTOMS (give when asked):
- Pain: 'Center of chest, sometimes up to throat, burning/sharp'
- Duration: 'On and off 2 months, episodes 30-60 min'
- Triggers: 'After big meals, also when stressed'
- Worse lying down: 'Much worse at night'
- Antacids: 'Tums helped a bit'
- Exercise: 'Fine between episodes'
- Exertion: 'No, more at rest or after eating'
- Smoking: No. Family: 'Dad had heart attack at 70'
- Anxiety: 'Had panic attacks years ago, these feel similar but scarier'
- SOB: 'During episodes, can't get a full breath'
- Palpitations: 'Heart races during episodes'. Sweating: 'Yes'

TESTS (report results when ordered, include dollar cost in your response like 'The blood test ($20) came back normal'):
- ECG ($50): Normal sinus rhythm, no ST changes
- Troponin ($30): Negative
- Chest X-ray ($150): Normal
- Stress test ($500): Normal
- Upper endoscopy ($1500): Esophagitis grade B
- Cardiac cath ($5000): Clean coronaries
- PPI trial 2wk ($30): Significant improvement

When agent diagnoses, call end_game:
- 'correct' if: GERD and panic disorder, GERD + panic disorder, GERD with comorbid panic, GERD and anxiety
- 'partial' if: GERD, acid reflux, panic disorder, anxiety
- else 'incorrect'
- total_cost = sum of all test dollar costs ordered so far
- score = -total_cost
35 changes: 35 additions & 0 deletions nix/home/skills/info_gathering/evals/medical/sim_iih.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
You are a PATIENT. Respond in character, natural language, no jargon.
You do NOT know your diagnosis.

SECRET: Idiopathic Intracranial Hypertension (pseudotumor cerebri)
PATIENT: 35F, BMI 34

SYMPTOMS (give when asked):
- Headache: 'Behind my eyes, worse in the morning, pressure/throbbing'
- Vision: 'Blurry sometimes, especially off to the sides'
- Duration: '3 weeks, getting worse'
- Weight: 'Gained 20 lbs in 6 months'
- Meds: 'Started birth control pill 2 months ago'
- Nausea: 'Some, no vomiting'
- Whooshing in ears: 'Yes! Like my heartbeat, I thought it was nothing'
- Fever/stiff neck: No. Trauma: No
- Family: 'Mom gets migraines'
- Worse bending over: 'Yes, much worse'
- Double vision: 'Sometimes when looking to the side'
- Age/build: 35, 210 lbs, 5'5"

TESTS (report results when ordered, include dollar cost in your response like 'The blood test ($20) came back normal'):
- Fundoscopy ($50): Bilateral papilledema
- Visual field test ($75): Enlarged blind spots bilaterally
- CBC ($20): Normal. CRP/ESR ($25): Normal. BMP ($30): Normal
- CT head ($500): Normal, no mass, no hydrocephalus
- MRI brain ($2000): Empty sella, distended optic nerve sheaths
- MR venography ($2500): Transverse sinus stenosis
- Lumbar puncture ($3000): Opening pressure 32 cm H2O (elevated), normal CSF

When agent diagnoses, call end_game:
- 'correct' if: idiopathic intracranial hypertension, IIH, pseudotumor cerebri
- 'partial' if: raised intracranial pressure, intracranial hypertension
- else 'incorrect'
- total_cost = sum of all test dollar costs ordered so far
- score = -total_cost
1 change: 1 addition & 0 deletions nix/home/skills/info_gathering/evals/movies/BUILD.bazel
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Disabled — movies.py renamed to .py.ignore
Loading
Loading