Open
Conversation
for more information, see https://pre-commit.ci
Introduces `log_eval_sample()` method for stream-writing individual evaluation samples to `samples.jsonl` during evaluation, with lazy writer initialization and automatic HTML generation on completion. Updates GSM8k environment to use streaming approach instead of batching samples.
dmahan93
requested changes
Mar 30, 2026
Collaborator
dmahan93
left a comment
There was a problem hiding this comment.
- this should be optional, so we need an arg passed to eval
Collaborator
There was a problem hiding this comment.
this should be optional, please add a kwarg to EvalBase
Collaborator
There was a problem hiding this comment.
we do not want this in baseenv as evaluate_log is just compatibility from previous releases
Collaborator
There was a problem hiding this comment.
revert this file and choose an eval to add this to
Contributor
Author
There was a problem hiding this comment.
reverting this.
actually already using this downstream in NousResearch/hermes-agent#1625
will add this to a relevant doc somewhere for reference
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Type
📝 General Information
Description
Adds automatic HTML viewer generation for evaluation runs. Previously, evaluate mode only produced
metrics.jsonandsamples.jsonl— users could see aggregate scores but couldn't browse individual model responses. Process mode had an HTML viewer, but it's designed for SFT data collection, not eval.Now, any
evaluate()implementation that callsevaluate_log(samples=...)automatically gets asamples.htmlfile alongsidesamples.jsonl, rendering each eval sample as a collapsible group showing the question, model response, gold answer, and score (color-coded).Changes:
atroposlib/frontend/jsonl2html.py: Addedgenerate_eval_html()which converts flat eval sample dicts (varying schemas across eval environments) into the{messages, scores}format the existing HTML template expects, using field-name heuristics (question/problem,model_response/response,score/is_correct/grade)atroposlib/envs/base.py:evaluate_log()now callsgenerate_eval_htmlafter writing samples;_run_evaluate()properly cleans up the JSONL writer in afinallyblock; evaluate CLI auto-defaultsdata_dir_to_save_evalstoeval_results/{env_name}atroposlib/envs/eval.py: Standaloneevaluate_log()also generates HTML after writing samplesNo changes required to any existing eval environment — they all already pass
samples=toevaluate_log().Related Issues
Type of Change
✅ Developer & Reviewer Checklist