feat: synthetic data bootstrapping to go from an env spec to calibrated samples#1080
feat: synthetic data bootstrapping to go from an env spec to calibrated samples#1080ilijalichkovski wants to merge 4 commits intoPrimeIntellect-ai:mainfrom
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| self, | ||
| env: Environment, | ||
| generator_model: str = "gpt-4.1", | ||
| filter_model: str = "gpt-4.1", |
There was a problem hiding this comment.
Inconsistent default model between builder and config
Medium Severity
SynthDataBuilder.__init__ defaults generator_model and filter_model to "gpt-4.1", while SynthConfig defaults them to "gpt-5.4-mini" and the example script uses "openai/gpt-5.4-mini". Since SynthDataBuilder.build() always overrides SynthConfig fields from self.generator_model/self.filter_model, the SynthConfig defaults are effectively dead, and programmatic users get a different model than the script or config suggest.
Additional Locations (1)
| "GRPOConfig", | ||
| "grpo_defaults", | ||
| "lora_defaults", | ||
| "SynthDataBuilder", |
There was a problem hiding this comment.
New public API lacks documentation updates
Low Severity
SynthDataBuilder is added to __all__ and _LAZY_IMPORTS, making it a core user-facing export, but none of the documentation files (docs/reference.md, docs/overview.md, docs/environments.md, docs/faqs.md) are updated to describe this new class or the synthetic data generation workflow.
Additional Locations (1)
Triggered by project rule: BugBot Instructions


Description
Adds a synthetic data bootstrapping functionality that takes an environment as an input, extracts its specification (system prompt, tools, reward funcs, dataset schema) and generates training samples.
This is envisioned to be useful when creating new environments on the fly, where minimal samples are available. Of course, environments are already used to create high-quality rollouts given the environment's dataset, but our aim here is different. This PR instead focuses on bootstrapping new tasks from arbitrary seeds (a few illustrative examples, or knowledge blobs,
SKILL.mdfiles, conversations, etc). In other words, we are trying to use the environment abstraction as a way to condition the process of bootstrapping a new synthetic data distribution, rather than getting great rollouts for a pre-provided task distribution.The pipeline has 4 steps:
filter_threshold) and optional novelty (filter_ceiling), then reporting coverage (C@Q) per subtopicThis is mostly an illustrative implementation atm; a more complete implementation would need some more work:
question/answerschemasType of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Medium Risk
Introduces a new LLM-driven generation/filtering pipeline with concurrency and strict schema validation; risk is mainly around correctness/cost and provider/key configuration, but it’s additive and covered by unit tests.
Overview
Adds synthetic data bootstrapping via new
verifiers.synthmodule, centered onSynthDataBuilder.build()which (1) introspects anEnvironmentinto anEnvSpec, (2) plans subtopics from bounded seeds, (3) fans out LLM generation of new rows that must match the environment dataset schema exactly, and (4) filters samples by learnability (filter_threshold) with an optional novelty ceiling (filter_ceiling), while computing per-subtopic coverage stats.Exposes the builder through
verifiers.__init__and adds aBuildResultwriter that emitsdata.jsonplus adataset_card.mdincluding coverage failures and the filter mode used. Also adds a runnable GSM8K synthesis script and a comprehensivetests/test_synth.pysuite covering seed normalization, JSON parsing, schema validation, prompt rendering, and result saving.Written by Cursor Bugbot for commit 0009bcf. This will update automatically on new commits. Configure here.