Skip to content

feat: synthetic data bootstrapping to go from an env spec to calibrated samples#1080

Open
ilijalichkovski wants to merge 4 commits intoPrimeIntellect-ai:mainfrom
ilijalichkovski:ilija/synth
Open

feat: synthetic data bootstrapping to go from an env spec to calibrated samples#1080
ilijalichkovski wants to merge 4 commits intoPrimeIntellect-ai:mainfrom
ilijalichkovski:ilija/synth

Conversation

@ilijalichkovski
Copy link
Copy Markdown

@ilijalichkovski ilijalichkovski commented Mar 31, 2026

Description

Adds a synthetic data bootstrapping functionality that takes an environment as an input, extracts its specification (system prompt, tools, reward funcs, dataset schema) and generates training samples.

This is envisioned to be useful when creating new environments on the fly, where minimal samples are available. Of course, environments are already used to create high-quality rollouts given the environment's dataset, but our aim here is different. This PR instead focuses on bootstrapping new tasks from arbitrary seeds (a few illustrative examples, or knowledge blobs, SKILL.md files, conversations, etc). In other words, we are trying to use the environment abstraction as a way to condition the process of bootstrapping a new synthetic data distribution, rather than getting great rollouts for a pre-provided task distribution.

The pipeline has 4 steps:

  • a single orchestrator sees a bounded seed sample (default: 3 examples), or only the env spec if no dataset exists, and infers subtopics plus generation guidance
  • fanning out across the subtopics and generating samples via back-translation, giving the synthesizer bounded reference material
  • enforcing that generated rows match the env dataset schema exactly, including keys and nested structure/shape
  • filtering for quality via learnability (filter_threshold) and optional novelty (filter_ceiling), then reporting coverage (C@Q) per subtopic

This is mostly an illustrative implementation atm; a more complete implementation would need some more work:

  • env-faithful filtering beyond the current task-field / answer-field abstraction for more complex row formats
  • more principled task/answer extraction for non-question/answer schemas
  • broader validation against more environments

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Medium Risk
Introduces a new LLM-driven generation/filtering pipeline with concurrency and strict schema validation; risk is mainly around correctness/cost and provider/key configuration, but it’s additive and covered by unit tests.

Overview
Adds synthetic data bootstrapping via new verifiers.synth module, centered on SynthDataBuilder.build() which (1) introspects an Environment into an EnvSpec, (2) plans subtopics from bounded seeds, (3) fans out LLM generation of new rows that must match the environment dataset schema exactly, and (4) filters samples by learnability (filter_threshold) with an optional novelty ceiling (filter_ceiling), while computing per-subtopic coverage stats.

Exposes the builder through verifiers.__init__ and adds a BuildResult writer that emits data.json plus a dataset_card.md including coverage failures and the filter mode used. Also adds a runnable GSM8K synthesis script and a comprehensive tests/test_synth.py suite covering seed normalization, JSON parsing, schema validation, prompt rendering, and result saving.

Written by Cursor Bugbot for commit 0009bcf. This will update automatically on new commits. Configure here.

@ilijalichkovski ilijalichkovski changed the title Ilija/synth feat: synthetic data bootstrapping to go from an env spec to calibrated samples Mar 31, 2026
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

self,
env: Environment,
generator_model: str = "gpt-4.1",
filter_model: str = "gpt-4.1",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent default model between builder and config

Medium Severity

SynthDataBuilder.__init__ defaults generator_model and filter_model to "gpt-4.1", while SynthConfig defaults them to "gpt-5.4-mini" and the example script uses "openai/gpt-5.4-mini". Since SynthDataBuilder.build() always overrides SynthConfig fields from self.generator_model/self.filter_model, the SynthConfig defaults are effectively dead, and programmatic users get a different model than the script or config suggest.

Additional Locations (1)
Fix in Cursor Fix in Web

"GRPOConfig",
"grpo_defaults",
"lora_defaults",
"SynthDataBuilder",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New public API lacks documentation updates

Low Severity

SynthDataBuilder is added to __all__ and _LAZY_IMPORTS, making it a core user-facing export, but none of the documentation files (docs/reference.md, docs/overview.md, docs/environments.md, docs/faqs.md) are updated to describe this new class or the synthetic data generation workflow.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant