add trajectory saving to eval mode. by alt-glitch · Pull Request #422 · NousResearch/atropos

alt-glitch · 2026-03-27T23:06:03Z

PR Type

Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

Adds automatic HTML viewer generation for evaluation runs. Previously, evaluate mode only produced metrics.json and samples.jsonl — users could see aggregate scores but couldn't browse individual model responses. Process mode had an HTML viewer, but it's designed for SFT data collection, not eval.

Now, any evaluate() implementation that calls evaluate_log(samples=...) automatically gets a samples.html file alongside samples.jsonl, rendering each eval sample as a collapsible group showing the question, model response, gold answer, and score (color-coded).

Changes:

atroposlib/frontend/jsonl2html.py: Added generate_eval_html() which converts flat eval sample dicts (varying schemas across eval environments) into the {messages, scores} format the existing HTML template expects, using field-name heuristics (question/problem, model_response/response, score/is_correct/grade)
atroposlib/envs/base.py: evaluate_log() now calls generate_eval_html after writing samples; _run_evaluate() properly cleans up the JSONL writer in a finally block; evaluate CLI auto-defaults data_dir_to_save_evals to eval_results/{env_name}
atroposlib/envs/eval.py: Standalone evaluate_log() also generates HTML after writing samples

No changes required to any existing eval environment — they all already pass samples= to evaluate_log().

Related Issues

Type of Change

New feature (non-breaking change which adds functionality)

✅ Developer & Reviewer Checklist

Code follows project style (black, isort, flake8 pass with pre-commit)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
New and existing unit tests pass locally with my changes
Docstrings added for all new public classes / functions
If .env vars required, did you add it to the .env.example in repo root?

for more information, see https://pre-commit.ci

Introduces `log_eval_sample()` method for stream-writing individual evaluation samples to `samples.jsonl` during evaluation, with lazy writer initialization and automatic HTML generation on completion. Updates GSM8k environment to use streaming approach instead of batching samples.

dmahan93

this should be optional, so we need an arg passed to eval

dmahan93 · 2026-03-30T17:20:21Z

this should be optional, please add a kwarg to EvalBase

dmahan93 · 2026-03-30T17:21:39Z

we do not want this in baseenv as evaluate_log is just compatibility from previous releases

dmahan93 · 2026-03-30T17:21:53Z

revert this file and choose an eval to add this to

reverting this.

actually already using this downstream in NousResearch/hermes-agent#1625

will add this to a relevant doc somewhere for reference

alt-glitch and others added 4 commits March 27, 2026 16:04

add trajectory saving to eval mode.

7a4edb5

Merge branch 'main' into sid/traj-saving-eval-mode

63d717e

[pre-commit.ci] auto fixes from pre-commit.com hooks

83a343d

for more information, see https://pre-commit.ci

dmahan93 requested changes Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add trajectory saving to eval mode.#422

add trajectory saving to eval mode.#422
alt-glitch wants to merge 4 commits intomainfrom
sid/traj-saving-eval-mode

alt-glitch commented Mar 27, 2026

Uh oh!

dmahan93 left a comment

Uh oh!

dmahan93 Mar 30, 2026

Uh oh!

dmahan93 Mar 30, 2026

Uh oh!

dmahan93 Mar 30, 2026

Uh oh!

alt-glitch Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alt-glitch commented Mar 27, 2026

PR Type

📝 General Information

Description

Related Issues

Type of Change

✅ Developer & Reviewer Checklist

Uh oh!

dmahan93 left a comment

Choose a reason for hiding this comment

Uh oh!

dmahan93 Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

dmahan93 Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

dmahan93 Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

alt-glitch Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants