feat: synthetic data bootstrapping to go from an env spec to calibrated samples by ilijalichkovski · Pull Request #1080 · PrimeIntellect-ai/verifiers

ilijalichkovski · 2026-03-31T11:00:57Z

Description

Adds a synthetic data bootstrapping functionality that takes an environment as an input, extracts its specification (system prompt, tools, reward funcs, dataset schema) and generates training samples.

This is envisioned to be useful when creating new environments on the fly, where minimal samples are available. Of course, environments are already used to create high-quality rollouts given the environment's dataset, but our aim here is different. This PR instead focuses on bootstrapping new tasks from arbitrary seeds (a few illustrative examples, or knowledge blobs, SKILL.md files, conversations, etc). In other words, we are trying to use the environment abstraction as a way to condition the process of bootstrapping a new synthetic data distribution, rather than getting great rollouts for a pre-provided task distribution.

The pipeline has 4 steps:

a single orchestrator sees a bounded seed sample (default: 3 examples), or only the env spec if no dataset exists, and infers subtopics plus generation guidance
fanning out across the subtopics and generating samples via back-translation, giving the synthesizer bounded reference material
enforcing that generated rows match the env dataset schema exactly, including keys and nested structure/shape
filtering for quality via learnability (filter_threshold) and optional novelty (filter_ceiling), then reporting coverage (C@Q) per subtopic

This is mostly an illustrative implementation atm; a more complete implementation would need some more work:

env-faithful filtering beyond the current task-field / answer-field abstraction for more complex row formats
more principled task/answer extraction for non-question/answer schemas
broader validation against more environments

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Medium Risk
Introduces a new LLM-driven generation/filtering pipeline with concurrency and strict schema validation; risk is mainly around correctness/cost and provider/key configuration, but it’s additive and covered by unit tests.

Overview
Adds synthetic data bootstrapping via new verifiers.synth module, centered on SynthDataBuilder.build() which (1) introspects an Environment into an EnvSpec, (2) plans subtopics from bounded seeds, (3) fans out LLM generation of new rows that must match the environment dataset schema exactly, and (4) filters samples by learnability (filter_threshold) with an optional novelty ceiling (filter_ceiling), while computing per-subtopic coverage stats.

Exposes the builder through verifiers.__init__ and adds a BuildResult writer that emits data.json plus a dataset_card.md including coverage failures and the filter mode used. Also adds a runnable GSM8K synthesis script and a comprehensive tests/test_synth.py suite covering seed normalization, JSON parsing, schema validation, prompt rendering, and result saving.

^{Written by Cursor Bugbot for commit 0009bcf. This will update automatically on new commits. Configure here.}

…ce to dataset schema

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-31T11:06:15Z

verifiers/synth/builder.py

+        self,
+        env: Environment,
+        generator_model: str = "gpt-4.1",
+        filter_model: str = "gpt-4.1",


Inconsistent default model between builder and config

Medium Severity

SynthDataBuilder.__init__ defaults generator_model and filter_model to "gpt-4.1", while SynthConfig defaults them to "gpt-5.4-mini" and the example script uses "openai/gpt-5.4-mini". Since SynthDataBuilder.build() always overrides SynthConfig fields from self.generator_model/self.filter_model, the SynthConfig defaults are effectively dead, and programmatic users get a different model than the script or config suggest.

Additional Locations (1)

verifiers/synth/models.py#L35-L37

cursor · 2026-03-31T11:06:15Z

verifiers/__init__.py

    "GRPOConfig",
    "grpo_defaults",
    "lora_defaults",
+    "SynthDataBuilder",


New public API lacks documentation updates

Low Severity

SynthDataBuilder is added to __all__ and _LAZY_IMPORTS, making it a core user-facing export, but none of the documentation files (docs/reference.md, docs/overview.md, docs/environments.md, docs/faqs.md) are updated to describe this new class or the synthetic data generation workflow.

Additional Locations (1)

verifiers/__init__.py#L124-L125

^{Triggered by project rule: BugBot Instructions}

ilijalichkovski added 4 commits March 31, 2026 12:47

feat: synthetic data builder to turn an env spec into calibrated samples

bfeb865

fix: ensure keys and fail loudly

6118bdb

chore: refactor, make orchestration more principled, stricter adheren…

8f02e6e

…ce to dataset schema

chore: use verifiers abstractions for client and messages

0009bcf

ilijalichkovski changed the title ~~Ilija/synth~~ feat: synthetic data bootstrapping to go from an env spec to calibrated samples Mar 31, 2026

cursor bot reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: synthetic data bootstrapping to go from an env spec to calibrated samples#1080

feat: synthetic data bootstrapping to go from an env spec to calibrated samples#1080
ilijalichkovski wants to merge 4 commits intoPrimeIntellect-ai:mainfrom
ilijalichkovski:ilija/synth

ilijalichkovski commented Mar 31, 2026 •

edited

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 31, 2026

Uh oh!

cursor bot Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ilijalichkovski commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 31, 2026

Choose a reason for hiding this comment

Inconsistent default model between builder and config

Uh oh!

cursor bot Mar 31, 2026

Choose a reason for hiding this comment

New public API lacks documentation updates

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ilijalichkovski commented Mar 31, 2026 •

edited

Loading