feat: Multi-turn agentic architecture by GeorgeWingg · Pull Request #56 · SakanaAI/ShinkaEvolve

GeorgeWingg · 2025-12-19T15:31:58Z

Summary

Adds multi-turn agentic editing and evaluation backends (Codex CLI, Shinka Agent)
Implements multi-file workspace support with embedding corpus for novelty detection
Adds bandit sampling integration for agentic mode
Includes new example configs for boids_flocking_agentic and circle_packing_agentic variants

Key Changes

Agentic backends: shinka/edit/agentic.py, codex_cli.py, shinka_agent.py - pluggable CLI harnesses that own system prompts and stream events
Agentic evaluator: shinka/eval/agentic.py - runs evaluation in agent sessions with metrics extraction
Multi-file corpus: shinka/core/embedding_corpus.py - builds embedding text from multiple workspace files
Runner integration: Full async job pipeline with thread-safe parallelism for agentic mode
Configs: New evolution/agentic.yaml base config and variant configs

Test plan

Unit tests for agentic editor and evaluator (tests/test_agentic_*.py)
Manual test with circle_packing_agentic config
Manual test with boids_flocking_agentic config

🤖 Generated with Claude Code

fix: Fix database summary when patch_name metadata is missing

…g model

Fix packaging so pip install ships full shinka module tree

… of expected 2 (end and start) markers

fix `apply_full.py` when the patch has incomplete markers

Doc explaining how to add suport for a local LLM and embedding model

…rust Add rust to supported languages

docs: change repo name on the onboarding doc

add google gemini embeding model

Enhance docs, robustify wrap_eval, Visualization w/o API key

- Add bandit model selection before agentic sessions (parity with legacy) - Track bandit-selected model for proper reward updates - Fix Codex backend to respect extra_cli_config model override - Fix apply_full_patch parameter names in agentic path - Fix boids_flocking variant config (add variant_suffix, remove n_pop)

- Add agentic variant config for boids multi-file task - Fix Hydra config override using @_global_ package syntax - Fix boids task config to nest evo_config properly for merging - Change default agentic model from gpt-5.2 to gpt-4.1 - Fix display.py NoneType subscript bug in patch_name 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add gpt-5.2 to OPENAI_MODELS pricing and REASONING_OAI_MODELS - Update agentic.yaml default model to gpt-5.2 - Add EXECPLAN_PR_READY.md for PR validation tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Run quality bar checks (V8) on PR-modified Python files only. - black with default config - isort with --profile black 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The PromptSampler was sending DIFF-format prompts to agentic sessions, causing agents to output <DIFF> XML instead of using shell commands. Root cause: PromptSampler had no awareness of agentic_mode. Fix: - AGENTIC_SYS_FORMAT is now empty (harness provides its own) - PromptSampler._sample_agentic() puts task context in user prompt - runner.py passes agentic_mode to PromptSampler Also fixed: - boids_flocking_agentic variant now correctly sets init_program_path - display.py handles None metadata gracefully V1.1 E2E test now passes: - Agent explores workspace with shell commands (ls, sed, etc.) - Files appear in gen_1/ - patch_type correctly set to "agentic" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The redact_immutable function returned empty string when code had no EVOLVE-BLOCK markers, causing embedding API to fail with 400 error. Now returns full text for embedding when no markers are present. This affects tasks like boids_flocking that don't use markers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

BREAKING: Removed silent fallback to gpt-4.1-mini in agentic backends. Before: If no model configured, silently used gpt-4.1-mini (old model) After: Raises clear error with instructions on how to configure Changes: - shinka_agent.py: Raises ShinkaExecutionError if no model - codex_cli.py: Raises CodexExecutionError if no model - agentic.yaml: Now explicitly sets model: "gpt-4.1" (required field) Also fixed: Inconsistent precedence order between backends Now both use: extra_cli_config["model"] > profile > FAIL Error message example: "No model configured for ShinkaAgent. Set evo_config.agentic.extra_cli_config.model..." 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Changes: - cost_utils.py: Log WARNING when model not in pricing table, use higher fallback rate ($10/M tokens) to make unknown models noticeable - credentials.py: Log DEBUG showing which credential source was used (env var vs credential file vs nested structure) - embedding.py: Consistent WARNING-level logging for both Gemini and OpenAI embedding failures; warn when model not in pricing table These changes help users diagnose configuration issues instead of silently using wrong values. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The agentic mode was running jobs sequentially because _run_full_agentic_job called self.db.sample() inside worker threads, causing race conditions (SQLite connections are not thread-safe). Changes: - Move db.sample() to main thread in _submit_agentic_job_async() - Pass parent_program, archive_programs, top_k_programs to worker thread - Worker threads only do edit + eval (no database access) - Main loop uses while-loop to fill job queue for agentic mode - Add ThreadPoolExecutor for parallel agentic job execution Performance improvement: - Before: ~1 generation per 10 minutes (sequential) - After: ~3 programs per minute with 4 parallel jobs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Variant configuration for Circle Packing task with agentic editing: - Uses gemini-2.5-flash (OpenAI quota issues) - 4 parallel jobs for full parallelism testing - UCB bandit model selection 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Changes: - Check agentic_mode (not evaluator_mode) for parallel job submission - Add _run_legacy_evaluation_sync() for thread-safe legacy eval via subprocess - _run_full_agentic_job now supports both legacy and agentic evaluation - Thread pool created when agentic_mode is enabled (regardless of evaluator) This allows: agentic editing (parallel) + legacy evaluation (deterministic) Circle packing now runs with parallel editing and real sum-of-radii scoring. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Two bugs fixed: 1. metrics_path in agentic evaluator was relative but checked against Python's CWD instead of repo_root - converted to absolute path 2. Exception handler in runner hardcoded correct=False even when metrics.json existed with correct=True - now reads from metrics Both fixes verified working: boids reached score 80.0 with correct=1 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Changed shinka_agent to execute ALL bash blocks in a response, not just the first one (some models like Gemini output multiple) - Updated system prompt to reflect this change - Added reasoning_efforts="auto" default to avoid empty responses - Updated evaluator prompt to be more explicit about output path 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add max_events attribute to AgenticConfig (was missing, caused AttributeError) - Fix agentic.py to use max_events instead of max_turns for Codex event limit - Increase default max_events from 80 to 240 (3x) for longer sessions - Add _to_primitive() helper to convert OmegaConf DictConfig to JSON-serializable types - Extract session_id parsing to shared event_utils.py module - Handle Codex CLI non-zero exit gracefully when events were processed - Consolidate CodexAuthError into codex_cli.py (was in deleted codex_device_auth.py) These fixes enable Codex backend to complete full evolution runs without crashes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove unused build_embedding_corpus() function and supporting code: - EmbeddingCorpus dataclass (unused) - _is_text_bytes(), _sha256_prefix(), _matches_any() helpers (unused) - 195 lines of dead code that was never integrated Only extract_file_content() is actually used in the codebase. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The codex_session_registry.py module was write-only dead code: - Created JSON files in ~/.codex/shinka_sessions/ tracking active sessions - But nothing ever read these files back Delete the module and remove all usages from codex_cli.py and shinka_agent.py. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This was internal planning notes, not meant for the final PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ASCII art rendering adds no value for headless evolution runs. Return None in headless mode instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Import from credentials.py instead of duplicating the mapping. Simplifies ensure_shinka_available() from 35 to 17 lines. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add comprehensive test coverage for agentic components: - test_agentic_editor.py (28 tests) - test_agentic_evaluator.py (13 tests) - test_shinka_agent.py (16 tests) - Update configs for boids/circle_packing tasks and variants - Update LLM models (gemini, openai, pricing, query) - Add gitignore for boids runtime artifacts - Remove deprecated codex_device_auth module - Remove unused boids initial.py (refactored to modular structure) - Fix database islands null-check for patch_name - Update scheduler and viz_tree for robustness 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Move logger initialization after all imports to follow PEP 8 conventions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace placeholder model 'gemini-3-flash-preview' with existing 'gemini-2.5-flash' model in boids and circle packing agentic configs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add EmbeddingCorpus dataclass to represent multi-file corpora - Implement build_embedding_corpus() for deterministic directory scanning - Add configurable glob patterns, size limits, and binary file handling - Refactor get_code_embedding() to support corpus mode with changed file prioritization - Maintain backward compatibility with existing single-file embedding mode - Add comprehensive logging for debugging corpus building This enables the novelty detection system to consider changes across multiple related files, improving semantic understanding for the agentic multi-turn editing architecture. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

zia1138 · 2025-12-30T22:58:34Z

@GeorgeWingg
Curious how well your fork worked. What challenges or advantages did using codex cli have that you noticed? Thanks!

GeorgeWingg · 2025-12-31T12:59:52Z

Hi, the reason I built this was to add the ability to take shinka's evolutionary approach and apply it to real engineering projects, where logic is in multiple files rather than a single file.

Also I have the perspective that it might be valuable for an AI scientist tool to be able to call tools like how the modern agentic models which was not possible with the previous shinka system.

Cons I would look out for is that the agent can run commands and change files in your file system, and with fork the commands they run are fairly abstracted away, so I would be careful running this on a computer with files you care about.

RobertTLange and others added 30 commits September 25, 2025 06:38

Update README.md with arxiv

1b4c179

add google gemini embeding model

2fb7548

fix: Fix database summary when patch_name metadata is missing

27af71c

Update README.md

9586cdb

Merge pull request SakanaAI#2 from dexhunter/fix/display

396c66a

fix: Fix database summary when patch_name metadata is missing

docs: change repo name on the onboarding doc

a60bc9e

Update README

0003552

Added a doc explaining how to add suport for a local LLM and embeddin…

be2e203

…g model

Add rust to supported languages

bf0c1d4

Ensure setuptools discovers subpackages

77d1819

Mark shinka.webui as a package

929f072

Merge pull request SakanaAI#18 from SakanaAI/fix-packaging

59a338c

Fix packaging so pip install ships full shinka module tree

fix apply_full.py when the patch has incomplete (0,1) markers instead…

23ace36

… of expected 2 (end and start) markers

Merge pull request SakanaAI#21 from 51616/fix-full-patch-no-markers-bug

06209a2

fix `apply_full.py` when the patch has incomplete markers

Merge pull request SakanaAI#12 from vicruz99/feature/local-models

c9c468b

Doc explaining how to add suport for a local LLM and embedding model

Update README.md

c5b1abe

Merge branch 'main' into lia/add-support-for-rust

ccc1326

Merge pull request SakanaAI#15 from LiaCastaneda/lia/add-support-for-…

e8ef6de

…rust Add rust to supported languages

Merge pull request SakanaAI#7 from Koki-Kazaore/docs/change_repo_name

d2211b2

docs: change repo name on the onboarding doc

Update inspirations.py - archive

ded4576

Merge pull request SakanaAI#1 from takeruhukushima/main

7ceea8c

add google gemini embeding model

Update dependencies gemini embed

ee6e8a5

Update dbase.py path default

a759778

Fix reasoning token sampling

c097a88

Fix anthropic budget sampling

6d5e208

fix shinka_launch --help

9b4d7c7

fix wrap_eval catch

d7a3f7e

add documentation for resuming experiments

397e0fd

fix OAI dependency db for visualization

f6896dc

Merge pull request SakanaAI#28 from SakanaAI/fix_minor

94a2805

Enhance docs, robustify wrap_eval, Visualization w/o API key

GeorgeWingg and others added 27 commits December 14, 2025 15:01

fix: correct embedding corpus args for agentic files

ec6307e

feat: propagate multi-file workspace between generations

810e318

fix: hydrate workspace for legacy multi-file patches

1fda8e3

docs: update EXECPLAN with silent fallback fixes

0946ee4

chore: remove PR planning document

d80bff2

This was internal planning notes, not meant for the final PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

chore: remove unused TerminalRenderer from boids example

36c448d

ASCII art rendering adds no value for headless evolution runs. Return None in headless mode instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: correct import order in codex_cli.py

92dbada

Move logger initialization after all imports to follow PEP 8 conventions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

RobertTLange force-pushed the main branch from 305e365 to 87f6a77 Compare March 3, 2026 10:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Multi-turn agentic architecture#56

feat: Multi-turn agentic architecture#56
GeorgeWingg wants to merge 86 commits intoSakanaAI:mainfrom
GeorgeWingg:feat/multi-turn-architecture-clean

GeorgeWingg commented Dec 19, 2025 •

edited

Loading

Uh oh!

zia1138 commented Dec 30, 2025

Uh oh!

GeorgeWingg commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Conversation

GeorgeWingg commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Test plan

Uh oh!

zia1138 commented Dec 30, 2025

Uh oh!

GeorgeWingg commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

GeorgeWingg commented Dec 19, 2025 •

edited

Loading