Skip to content

add trajectory saving to eval mode.#422

Open
alt-glitch wants to merge 4 commits intomainfrom
sid/traj-saving-eval-mode
Open

add trajectory saving to eval mode.#422
alt-glitch wants to merge 4 commits intomainfrom
sid/traj-saving-eval-mode

Conversation

@alt-glitch
Copy link
Copy Markdown
Contributor

PR Type

  • Non-Environment PR - Complete Description, Related Issues & Type of Change sections

📝 General Information

Description

Adds automatic HTML viewer generation for evaluation runs. Previously, evaluate mode only produced metrics.json and samples.jsonl — users could see aggregate scores but couldn't browse individual model responses. Process mode had an HTML viewer, but it's designed for SFT data collection, not eval.

Now, any evaluate() implementation that calls evaluate_log(samples=...) automatically gets a samples.html file alongside samples.jsonl, rendering each eval sample as a collapsible group showing the question, model response, gold answer, and score (color-coded).

Changes:

  • atroposlib/frontend/jsonl2html.py: Added generate_eval_html() which converts flat eval sample dicts (varying schemas across eval environments) into the {messages, scores} format the existing HTML template expects, using field-name heuristics (question/problem, model_response/response, score/is_correct/grade)
  • atroposlib/envs/base.py: evaluate_log() now calls generate_eval_html after writing samples; _run_evaluate() properly cleans up the JSONL writer in a finally block; evaluate CLI auto-defaults data_dir_to_save_evals to eval_results/{env_name}
  • atroposlib/envs/eval.py: Standalone evaluate_log() also generates HTML after writing samples

No changes required to any existing eval environment — they all already pass samples= to evaluate_log().

Related Issues

Type of Change

  • New feature (non-breaking change which adds functionality)

✅ Developer & Reviewer Checklist

  • Code follows project style (black, isort, flake8 pass with pre-commit)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes
  • Docstrings added for all new public classes / functions
  • If .env vars required, did you add it to the .env.example in repo root?

alt-glitch and others added 4 commits March 27, 2026 16:04
Introduces `log_eval_sample()` method for stream-writing individual
evaluation samples to `samples.jsonl` during evaluation, with lazy
writer initialization and automatic HTML generation on completion.
Updates GSM8k environment to use streaming approach instead of batching
samples.
Copy link
Copy Markdown
Collaborator

@dmahan93 dmahan93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • this should be optional, so we need an arg passed to eval

Comment thread atroposlib/envs/eval.py
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be optional, please add a kwarg to EvalBase

Comment thread atroposlib/envs/base.py
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not want this in baseenv as evaluate_log is just compatibility from previous releases

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this file and choose an eval to add this to

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverting this.

actually already using this downstream in NousResearch/hermes-agent#1625

will add this to a relevant doc somewhere for reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants