Skip to content

feat: Multi-turn agentic architecture#56

Open
GeorgeWingg wants to merge 86 commits intoSakanaAI:mainfrom
GeorgeWingg:feat/multi-turn-architecture-clean
Open

feat: Multi-turn agentic architecture#56
GeorgeWingg wants to merge 86 commits intoSakanaAI:mainfrom
GeorgeWingg:feat/multi-turn-architecture-clean

Conversation

@GeorgeWingg
Copy link
Copy Markdown

@GeorgeWingg GeorgeWingg commented Dec 19, 2025

Summary

  • Adds multi-turn agentic editing and evaluation backends (Codex CLI, Shinka Agent)
  • Implements multi-file workspace support with embedding corpus for novelty detection
  • Adds bandit sampling integration for agentic mode
  • Includes new example configs for boids_flocking_agentic and circle_packing_agentic variants

Key Changes

  • Agentic backends: shinka/edit/agentic.py, codex_cli.py, shinka_agent.py - pluggable CLI harnesses that own system prompts and stream events
  • Agentic evaluator: shinka/eval/agentic.py - runs evaluation in agent sessions with metrics extraction
  • Multi-file corpus: shinka/core/embedding_corpus.py - builds embedding text from multiple workspace files
  • Runner integration: Full async job pipeline with thread-safe parallelism for agentic mode
  • Configs: New evolution/agentic.yaml base config and variant configs

Test plan

  • Unit tests for agentic editor and evaluator (tests/test_agentic_*.py)
  • Manual test with circle_packing_agentic config
  • Manual test with boids_flocking_agentic config

🤖 Generated with Claude Code

RobertTLange and others added 30 commits September 25, 2025 06:38
fix: Fix database summary when patch_name metadata is missing
Fix packaging so pip install ships full shinka module tree
fix `apply_full.py` when the patch has incomplete markers
Doc explaining how to add suport for a local LLM and embedding model
docs: change repo name on the onboarding doc
add google gemini embeding model
Enhance docs, robustify wrap_eval, Visualization w/o API key
GeorgeWingg and others added 27 commits December 14, 2025 15:01
- Add bandit model selection before agentic sessions (parity with legacy)

- Track bandit-selected model for proper reward updates

- Fix Codex backend to respect extra_cli_config model override

- Fix apply_full_patch parameter names in agentic path

- Fix boids_flocking variant config (add variant_suffix, remove n_pop)
- Add agentic variant config for boids multi-file task
- Fix Hydra config override using @_global_ package syntax
- Fix boids task config to nest evo_config properly for merging
- Change default agentic model from gpt-5.2 to gpt-4.1
- Fix display.py NoneType subscript bug in patch_name

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add gpt-5.2 to OPENAI_MODELS pricing and REASONING_OAI_MODELS
- Update agentic.yaml default model to gpt-5.2
- Add EXECPLAN_PR_READY.md for PR validation tracking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Run quality bar checks (V8) on PR-modified Python files only.
- black with default config
- isort with --profile black

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The PromptSampler was sending DIFF-format prompts to agentic sessions,
causing agents to output <DIFF> XML instead of using shell commands.

Root cause: PromptSampler had no awareness of agentic_mode.

Fix:
- AGENTIC_SYS_FORMAT is now empty (harness provides its own)
- PromptSampler._sample_agentic() puts task context in user prompt
- runner.py passes agentic_mode to PromptSampler

Also fixed:
- boids_flocking_agentic variant now correctly sets init_program_path
- display.py handles None metadata gracefully

V1.1 E2E test now passes:
- Agent explores workspace with shell commands (ls, sed, etc.)
- Files appear in gen_1/
- patch_type correctly set to "agentic"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The redact_immutable function returned empty string when code had no
EVOLVE-BLOCK markers, causing embedding API to fail with 400 error.

Now returns full text for embedding when no markers are present.
This affects tasks like boids_flocking that don't use markers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
BREAKING: Removed silent fallback to gpt-4.1-mini in agentic backends.

Before: If no model configured, silently used gpt-4.1-mini (old model)
After: Raises clear error with instructions on how to configure

Changes:
- shinka_agent.py: Raises ShinkaExecutionError if no model
- codex_cli.py: Raises CodexExecutionError if no model
- agentic.yaml: Now explicitly sets model: "gpt-4.1" (required field)

Also fixed: Inconsistent precedence order between backends
Now both use: extra_cli_config["model"] > profile > FAIL

Error message example:
"No model configured for ShinkaAgent. Set evo_config.agentic.extra_cli_config.model..."

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes:
- cost_utils.py: Log WARNING when model not in pricing table, use higher
  fallback rate ($10/M tokens) to make unknown models noticeable
- credentials.py: Log DEBUG showing which credential source was used
  (env var vs credential file vs nested structure)
- embedding.py: Consistent WARNING-level logging for both Gemini and
  OpenAI embedding failures; warn when model not in pricing table

These changes help users diagnose configuration issues instead of
silently using wrong values.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The agentic mode was running jobs sequentially because _run_full_agentic_job
called self.db.sample() inside worker threads, causing race conditions
(SQLite connections are not thread-safe).

Changes:
- Move db.sample() to main thread in _submit_agentic_job_async()
- Pass parent_program, archive_programs, top_k_programs to worker thread
- Worker threads only do edit + eval (no database access)
- Main loop uses while-loop to fill job queue for agentic mode
- Add ThreadPoolExecutor for parallel agentic job execution

Performance improvement:
- Before: ~1 generation per 10 minutes (sequential)
- After: ~3 programs per minute with 4 parallel jobs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Variant configuration for Circle Packing task with agentic editing:
- Uses gemini-2.5-flash (OpenAI quota issues)
- 4 parallel jobs for full parallelism testing
- UCB bandit model selection

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes:
- Check agentic_mode (not evaluator_mode) for parallel job submission
- Add _run_legacy_evaluation_sync() for thread-safe legacy eval via subprocess
- _run_full_agentic_job now supports both legacy and agentic evaluation
- Thread pool created when agentic_mode is enabled (regardless of evaluator)

This allows: agentic editing (parallel) + legacy evaluation (deterministic)
Circle packing now runs with parallel editing and real sum-of-radii scoring.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Two bugs fixed:
1. metrics_path in agentic evaluator was relative but checked against
   Python's CWD instead of repo_root - converted to absolute path
2. Exception handler in runner hardcoded correct=False even when
   metrics.json existed with correct=True - now reads from metrics

Both fixes verified working: boids reached score 80.0 with correct=1

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Changed shinka_agent to execute ALL bash blocks in a response,
  not just the first one (some models like Gemini output multiple)
- Updated system prompt to reflect this change
- Added reasoning_efforts="auto" default to avoid empty responses
- Updated evaluator prompt to be more explicit about output path

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add max_events attribute to AgenticConfig (was missing, caused AttributeError)
- Fix agentic.py to use max_events instead of max_turns for Codex event limit
- Increase default max_events from 80 to 240 (3x) for longer sessions
- Add _to_primitive() helper to convert OmegaConf DictConfig to JSON-serializable types
- Extract session_id parsing to shared event_utils.py module
- Handle Codex CLI non-zero exit gracefully when events were processed
- Consolidate CodexAuthError into codex_cli.py (was in deleted codex_device_auth.py)

These fixes enable Codex backend to complete full evolution runs without crashes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove unused build_embedding_corpus() function and supporting code:
- EmbeddingCorpus dataclass (unused)
- _is_text_bytes(), _sha256_prefix(), _matches_any() helpers (unused)
- 195 lines of dead code that was never integrated

Only extract_file_content() is actually used in the codebase.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The codex_session_registry.py module was write-only dead code:
- Created JSON files in ~/.codex/shinka_sessions/ tracking active sessions
- But nothing ever read these files back

Delete the module and remove all usages from codex_cli.py and shinka_agent.py.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This was internal planning notes, not meant for the final PR.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ASCII art rendering adds no value for headless evolution runs.
Return None in headless mode instead.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Import from credentials.py instead of duplicating the mapping.
Simplifies ensure_shinka_available() from 35 to 17 lines.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add comprehensive test coverage for agentic components:
  - test_agentic_editor.py (28 tests)
  - test_agentic_evaluator.py (13 tests)
  - test_shinka_agent.py (16 tests)
- Update configs for boids/circle_packing tasks and variants
- Update LLM models (gemini, openai, pricing, query)
- Add gitignore for boids runtime artifacts
- Remove deprecated codex_device_auth module
- Remove unused boids initial.py (refactored to modular structure)
- Fix database islands null-check for patch_name
- Update scheduler and viz_tree for robustness

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move logger initialization after all imports to follow PEP 8 conventions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace placeholder model 'gemini-3-flash-preview' with existing
'gemini-2.5-flash' model in boids and circle packing agentic configs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add EmbeddingCorpus dataclass to represent multi-file corpora
- Implement build_embedding_corpus() for deterministic directory scanning
- Add configurable glob patterns, size limits, and binary file handling
- Refactor get_code_embedding() to support corpus mode with changed file prioritization
- Maintain backward compatibility with existing single-file embedding mode
- Add comprehensive logging for debugging corpus building

This enables the novelty detection system to consider changes across
multiple related files, improving semantic understanding for the agentic
multi-turn editing architecture.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@zia1138
Copy link
Copy Markdown

zia1138 commented Dec 30, 2025

@GeorgeWingg
Curious how well your fork worked. What challenges or advantages did using codex cli have that you noticed? Thanks!

@GeorgeWingg
Copy link
Copy Markdown
Author

Hi, the reason I built this was to add the ability to take shinka's evolutionary approach and apply it to real engineering projects, where logic is in multiple files rather than a single file.

Also I have the perspective that it might be valuable for an AI scientist tool to be able to call tools like how the modern agentic models which was not possible with the previous shinka system.

Cons I would look out for is that the agent can run commands and change files in your file system, and with fork the commands they run are fairly abstracted away, so I would be careful running this on a computer with files you care about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.