feat: Multi-turn agentic architecture#56
Conversation
fix: Fix database summary when patch_name metadata is missing
Fix packaging so pip install ships full shinka module tree
… of expected 2 (end and start) markers
fix `apply_full.py` when the patch has incomplete markers
Doc explaining how to add suport for a local LLM and embedding model
…rust Add rust to supported languages
docs: change repo name on the onboarding doc
add google gemini embeding model
Enhance docs, robustify wrap_eval, Visualization w/o API key
- Add bandit model selection before agentic sessions (parity with legacy) - Track bandit-selected model for proper reward updates - Fix Codex backend to respect extra_cli_config model override - Fix apply_full_patch parameter names in agentic path - Fix boids_flocking variant config (add variant_suffix, remove n_pop)
- Add agentic variant config for boids multi-file task - Fix Hydra config override using @_global_ package syntax - Fix boids task config to nest evo_config properly for merging - Change default agentic model from gpt-5.2 to gpt-4.1 - Fix display.py NoneType subscript bug in patch_name 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add gpt-5.2 to OPENAI_MODELS pricing and REASONING_OAI_MODELS - Update agentic.yaml default model to gpt-5.2 - Add EXECPLAN_PR_READY.md for PR validation tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Run quality bar checks (V8) on PR-modified Python files only. - black with default config - isort with --profile black 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The PromptSampler was sending DIFF-format prompts to agentic sessions, causing agents to output <DIFF> XML instead of using shell commands. Root cause: PromptSampler had no awareness of agentic_mode. Fix: - AGENTIC_SYS_FORMAT is now empty (harness provides its own) - PromptSampler._sample_agentic() puts task context in user prompt - runner.py passes agentic_mode to PromptSampler Also fixed: - boids_flocking_agentic variant now correctly sets init_program_path - display.py handles None metadata gracefully V1.1 E2E test now passes: - Agent explores workspace with shell commands (ls, sed, etc.) - Files appear in gen_1/ - patch_type correctly set to "agentic" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The redact_immutable function returned empty string when code had no EVOLVE-BLOCK markers, causing embedding API to fail with 400 error. Now returns full text for embedding when no markers are present. This affects tasks like boids_flocking that don't use markers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
BREAKING: Removed silent fallback to gpt-4.1-mini in agentic backends. Before: If no model configured, silently used gpt-4.1-mini (old model) After: Raises clear error with instructions on how to configure Changes: - shinka_agent.py: Raises ShinkaExecutionError if no model - codex_cli.py: Raises CodexExecutionError if no model - agentic.yaml: Now explicitly sets model: "gpt-4.1" (required field) Also fixed: Inconsistent precedence order between backends Now both use: extra_cli_config["model"] > profile > FAIL Error message example: "No model configured for ShinkaAgent. Set evo_config.agentic.extra_cli_config.model..." 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes: - cost_utils.py: Log WARNING when model not in pricing table, use higher fallback rate ($10/M tokens) to make unknown models noticeable - credentials.py: Log DEBUG showing which credential source was used (env var vs credential file vs nested structure) - embedding.py: Consistent WARNING-level logging for both Gemini and OpenAI embedding failures; warn when model not in pricing table These changes help users diagnose configuration issues instead of silently using wrong values. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The agentic mode was running jobs sequentially because _run_full_agentic_job called self.db.sample() inside worker threads, causing race conditions (SQLite connections are not thread-safe). Changes: - Move db.sample() to main thread in _submit_agentic_job_async() - Pass parent_program, archive_programs, top_k_programs to worker thread - Worker threads only do edit + eval (no database access) - Main loop uses while-loop to fill job queue for agentic mode - Add ThreadPoolExecutor for parallel agentic job execution Performance improvement: - Before: ~1 generation per 10 minutes (sequential) - After: ~3 programs per minute with 4 parallel jobs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Variant configuration for Circle Packing task with agentic editing: - Uses gemini-2.5-flash (OpenAI quota issues) - 4 parallel jobs for full parallelism testing - UCB bandit model selection 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes: - Check agentic_mode (not evaluator_mode) for parallel job submission - Add _run_legacy_evaluation_sync() for thread-safe legacy eval via subprocess - _run_full_agentic_job now supports both legacy and agentic evaluation - Thread pool created when agentic_mode is enabled (regardless of evaluator) This allows: agentic editing (parallel) + legacy evaluation (deterministic) Circle packing now runs with parallel editing and real sum-of-radii scoring. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Two bugs fixed: 1. metrics_path in agentic evaluator was relative but checked against Python's CWD instead of repo_root - converted to absolute path 2. Exception handler in runner hardcoded correct=False even when metrics.json existed with correct=True - now reads from metrics Both fixes verified working: boids reached score 80.0 with correct=1 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Changed shinka_agent to execute ALL bash blocks in a response, not just the first one (some models like Gemini output multiple) - Updated system prompt to reflect this change - Added reasoning_efforts="auto" default to avoid empty responses - Updated evaluator prompt to be more explicit about output path 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add max_events attribute to AgenticConfig (was missing, caused AttributeError) - Fix agentic.py to use max_events instead of max_turns for Codex event limit - Increase default max_events from 80 to 240 (3x) for longer sessions - Add _to_primitive() helper to convert OmegaConf DictConfig to JSON-serializable types - Extract session_id parsing to shared event_utils.py module - Handle Codex CLI non-zero exit gracefully when events were processed - Consolidate CodexAuthError into codex_cli.py (was in deleted codex_device_auth.py) These fixes enable Codex backend to complete full evolution runs without crashes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove unused build_embedding_corpus() function and supporting code: - EmbeddingCorpus dataclass (unused) - _is_text_bytes(), _sha256_prefix(), _matches_any() helpers (unused) - 195 lines of dead code that was never integrated Only extract_file_content() is actually used in the codebase. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The codex_session_registry.py module was write-only dead code: - Created JSON files in ~/.codex/shinka_sessions/ tracking active sessions - But nothing ever read these files back Delete the module and remove all usages from codex_cli.py and shinka_agent.py. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This was internal planning notes, not meant for the final PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ASCII art rendering adds no value for headless evolution runs. Return None in headless mode instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Import from credentials.py instead of duplicating the mapping. Simplifies ensure_shinka_available() from 35 to 17 lines. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add comprehensive test coverage for agentic components: - test_agentic_editor.py (28 tests) - test_agentic_evaluator.py (13 tests) - test_shinka_agent.py (16 tests) - Update configs for boids/circle_packing tasks and variants - Update LLM models (gemini, openai, pricing, query) - Add gitignore for boids runtime artifacts - Remove deprecated codex_device_auth module - Remove unused boids initial.py (refactored to modular structure) - Fix database islands null-check for patch_name - Update scheduler and viz_tree for robustness 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move logger initialization after all imports to follow PEP 8 conventions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace placeholder model 'gemini-3-flash-preview' with existing 'gemini-2.5-flash' model in boids and circle packing agentic configs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add EmbeddingCorpus dataclass to represent multi-file corpora - Implement build_embedding_corpus() for deterministic directory scanning - Add configurable glob patterns, size limits, and binary file handling - Refactor get_code_embedding() to support corpus mode with changed file prioritization - Maintain backward compatibility with existing single-file embedding mode - Add comprehensive logging for debugging corpus building This enables the novelty detection system to consider changes across multiple related files, improving semantic understanding for the agentic multi-turn editing architecture. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
@GeorgeWingg |
|
Hi, the reason I built this was to add the ability to take shinka's evolutionary approach and apply it to real engineering projects, where logic is in multiple files rather than a single file. Also I have the perspective that it might be valuable for an AI scientist tool to be able to call tools like how the modern agentic models which was not possible with the previous shinka system. Cons I would look out for is that the agent can run commands and change files in your file system, and with fork the commands they run are fairly abstracted away, so I would be careful running this on a computer with files you care about. |
Summary
Key Changes
shinka/edit/agentic.py,codex_cli.py,shinka_agent.py- pluggable CLI harnesses that own system prompts and stream eventsshinka/eval/agentic.py- runs evaluation in agent sessions with metrics extractionshinka/core/embedding_corpus.py- builds embedding text from multiple workspace filesevolution/agentic.yamlbase config and variant configsTest plan
tests/test_agentic_*.py)🤖 Generated with Claude Code