Evolutionary search over SAT solver restart heuristics using LLMs. The pipeline generates candidate heuristic function implementations via Gemini, evaluates them on SAT Competition 2025 benchmarks via SLURM, promotes the best, and evolves new candidates through LLM-guided genetic crossover.
PYTHONPATH=src N_LEADERS=10 M_VARIANTS=10 ./run_loop_a.sh cc experiment 3 --init
# Outputs will be experiment_iter0, experiment_iter1, _iter2, experiment_iter3, which gets fed into genetic evolution next
PYTHONPATH=src TOP_K=5 PAR2_KEEP_TOP_N=10 ./run_bridge.sh cc experiment_iter3 experiment_gen1
# Now our best algos (ideally) are in experiment_gen1
# Go back to iterative member loop:
PYTHONPATH=src M_VARIANTS=5 ./run_loop_a.sh cc experiment_gen1 3
# output: 1,2, experiment_gen1_iter3
# More genetic evolution:
PYTHONPATH=src ./run_bridge.sh cc experiment_gen1_iter3 experiment_gen2
git clone https://github.com/ShiminZhang/LLM-SAT.git
cd LLM-SAT
pip install -r requirements.txt
pip install -e .Copy the template and fill in your cluster-specific paths:
cp path_config.template.yaml path_config.yamlEdit path_config.yaml:
base_solver: "/home/you/scratch/LLM-SAT/solvers/base"
python_activate: "~/your-venv/bin/activate"This file is gitignored. All scripts and Python modules read paths from here, so you only set them once.
Create a .env file in the project root:
DB_PASS="<postgres password>"
OPENAI_API_KEY="<openai key>"
GOOGLE_API_KEY="<google/gemini key>"
DB_PASS is for the shared PostgreSQL database that stores algorithms, code, and scores. OPENAI_API_KEY is used by the genetic evolution pipeline. GOOGLE_API_KEY is used by data generation (Gemini batch API).
The base solver is AE_kissatMAB. Place the AE_kissat2025_MAB.tar.xz tarball in the repo root and run:
bash setup_aemab.shThis extracts the solver to your configured base_solver path, copies the function registry, and runs ./configure.
To compare results across different clusters, run the baseline solver and record its PAR2:
# Submit baseline evaluation jobs
python scripts/evaluate_baseline.py (--quick-eval)
# After SLURM jobs complete, collect results
python scripts/evaluate_baseline.py --collect (--quick-eval)The script will print the baseline PAR2. Add it to your path_config.yaml:
baseline_par2: 1234.56 # Replace with your actual valueThis enables normalized PAR2 scores (1.0 = same as baseline, lower is better).
Place track_main_2025.uri in the repo root (the SAT Competition 2025 URI list), then:
bash scripts/download_satcomp2025.shDownloads and extracts ~400 CNF files to data/benchmarks/satcomp2025/.
The file solvers/base/function_registry.yaml tells the evaluation pipeline which C function to replace and where it lives in the source. It currently targets kissat_restarting and restart_mab in src/restart.c:
functions:
kissat_restarting:
file: "src/restart.c"
start_line: 15
end_line: 38
signature: "bool kissat_restarting(kissat *solver)"To target a different function, use configure_target.py:
# Switch to a different function (auto-detects source file)
python scripts/configure_target.py kissat_restarting
# Specify the source file explicitly
python scripts/configure_target.py restart_mab --file src/restart.c
# Use a different solver path
python scripts/configure_target.py my_func --solver /path/to/solverThis updates the function registry, rewrites the prompt templates (leader_prompt_testing.txt and coder_prompt_testing.txt) with the correct target name, signature, and embedded source file. Without this script you'd need to manually edit both prompts and the registry.
Update the SLURM settings in src/llmsat/pipelines/evaluation.py for your cluster:
The pipeline has two main phases that alternate:
- Loop A (
run_loop_a.sh) — Leader refinement: generates mutant variants of leaders, evaluates them, and promotes the best - Bridge (
run_bridge.sh) — Genetic evolution: combines top leaders via LLM-guided crossover to produce new offspring
A complete evolution cycle is just 3 commands:
./run_loop_a.sh cc my_tag 3 --init # Initialize + 3 refinement iterations
./run_bridge.sh cc my_tag_iter3 my_gen1 # Genetic crossover
./run_loop_a.sh cc my_gen1 3 # Continue refining the offspringRuns N iterations of: generate mutant variants → SLURM evaluation → collect results → promote best member to leader.
Usage:
./run_loop_a.sh <cc|nersc> <base_tag> <n_iterations> [source_tag] [--init]Arguments:
| Argument | Description |
|---|---|
cc|nersc |
Cluster: cc uses evaluation.py (Compute Canada), nersc uses evaluation_nersc.py (Perlmutter, packs 128 solver runs per node) |
base_tag |
Base name for iteration tags ({base_tag}_iter1, _iter2, ...) |
n_iterations |
Number of mutate→evaluate→promote cycles |
source_tag |
(Optional) Tag to load initial leaders from. Defaults to {base_tag} |
--init |
(Optional) Generate initial leaders + members + code before starting iterations |
Environment variables:
| Variable | Default | Description |
|---|---|---|
M_VARIANTS |
3 | Number of mutant variants per leader |
MODEL |
gemini-3-flash-preview |
LLM model for generation |
POLL_INTERVAL |
120 | Seconds between SLURM job status checks |
N_LEADERS |
5 | (Init mode only) Number of leaders to generate |
DESIGNER_PROMPT |
data/prompts/leader_prompt_testing.txt |
(Init mode only) Path to leader prompt |
Examples:
# Initialize a new population and run 3 refinement iterations
./run_loop_a.sh cc gemini_trial5 3 --init
# Continue refining from existing leaders (no init)
./run_loop_a.sh cc gemini_trial5 2 gemini_trial5_iter3
# Use custom settings
N_LEADERS=10 M_VARIANTS=5 ./run_loop_a.sh cc gemini_trial5 3 --initWhat happens with --init:
- Generates
N_LEADERSleader algorithms (natural language) - Generates
M_VARIANTSmember variants per leader - Translates all algorithms to C code
- Builds solvers and submits SLURM evaluation jobs
- Polls SLURM until all jobs complete
- Collects PAR2 scores
- Promotes best member to leader in each team
- Proceeds to mutation iterations 1..N
What happens in each iteration:
- Generates
M_VARIANTSmutant variants for each leader - Builds and submits SLURM evaluation (skips already-evaluated leaders)
- Polls until jobs complete
- Collects PAR2 scores
- Promotes best member to leader
Promotes top offspring from a refined population, runs LLM-guided genetic crossover to combine leaders, evaluates the offspring, and collects results.
Usage:
./run_bridge.sh <cc|nersc> <input_tag> <output_tag>Arguments:
| Argument | Description |
|---|---|
cc|nersc |
Cluster: cc for Compute Canada, nersc passes --nersc to genetic_evolution.py to use the NERSC evaluation backend |
input_tag |
Tag to read evaluated leaders from (scans _iter1, _iter2, ... automatically) |
output_tag |
Tag for the offspring population |
Environment variables:
| Variable | Default | Description |
|---|---|---|
TOP_K |
5 | LLM combination proposals per minibatch |
MINIBATCH_SIZE |
10 | Leaders per LLM proposal call |
RUBRIC_MIN |
6.0 | Minimum proposal score to proceed |
RUBRIC_KEEP_TOP_N |
10 | Keep top-N proposals after score filter |
SHUFFLE_PASSES |
1 | Number of shuffled minibatch passes |
MODEL |
gemini-3-flash-preview |
LLM model |
PAR2_KEEP_TOP_N |
7 | Keep top-N offspring by PAR2 score |
POLL_INTERVAL |
120 | Seconds between SLURM job status checks |
Example:
# Run genetic evolution on refined leaders from iter3
TOP_K=3 MINIBATCH_SIZE=5 ./run_bridge.sh cc my_tag_iter3 my_gen1What happens:
- Scans all
{input_tag}_iter*directories for evaluated leaders - Selects top leaders by PAR2 score
- LLM analyzes each leader (strengths, weaknesses, key mechanisms)
- LLM proposes leader pairs to combine
- LLM generates offspring algorithms via crossover
- Translates offspring to C code
- Builds and submits SLURM evaluation
- Polls until all jobs complete
- Collects PAR2 scores and keeps top offspring
Here's a complete end-to-end run evolving SAT solver heuristics:
# 1. Initialize population with 5 leaders, 3 variants each, run 3 refinement iterations
N_LEADERS=5 M_VARIANTS=3 ./run_loop_a.sh cc experiment1 3 --init
# Output: experiment1_iter0 (initial), experiment1_iter1, _iter2, _iter3 (refined)
# Best leaders are in experiment1_iter3
# 2. Run genetic evolution to combine top leaders into new offspring
TOP_K=5 PAR2_KEEP_TOP_N=7 ./run_bridge.sh cc experiment1_iter3 experiment1_gen1
# Output: experiment1_gen1 (offspring from crossover)
# 3. Continue refining the new generation
M_VARIANTS=3 ./run_loop_a.sh cc experiment1_gen1 3
# Output: experiment1_gen1_iter1, _iter2, _iter3
# 4. Run another round of genetic evolution
./run_bridge.sh cc experiment1_gen1_iter3 experiment1_gen2
# ... and so onResults location:
- Solver binaries and code:
solvers/<TAG>/{leaders,members}/algorithm_<ID>/code_<ID>/ - Per-instance times:
results/solving_times_<code_id>.json - PAR2 breakdown:
results/par2_breakdown_<code_id>.json - Solver statistics:
results/solver_stats_<code_id>.json
PAR2 scoring:
The PAR2 score is the average solving time across all benchmark instances. Unsolved instances (timeout, crash, OOM) receive a penalty of 2× the timeout:
- Quick eval (
--quick-eval): 50 CNFs, 600s timeout → 1200s penalty - Full eval: 400 CNFs, 5000s timeout → 10000s penalty
Prompt files live in data/prompts/:
| File | Purpose |
|---|---|
leader_prompt_testing.txt |
Designer prompt for generating leader algorithms |
variant_prompt.txt |
Template for generating member variants (uses {leader_algorithm} and {target_step_num} placeholders) |
coder_prompt_testing.txt |
Template for translating algorithms to C code (uses ALGORITHM_PLACEHOLDER) |
To run evaluation and validation on existing best solvers from iterations:
PYTHONPATH=src ./run_loop_eval_success.sh <tag> <cluster> --no-clean
To run validation only:
python scripts/verify_iteration_proofs.py \
--submit-slurm \
--generation_tag "$generation_tag" \
--benchmark_path data/benchmarks/satcomp2025 \
--drat_trim "$DRAT_TRIM_CMD" \
--check_timeout "$PROOF_CHECK_TIMEOUT" \
--slurm-mem "$PROOF_VERIFY_MEM" \
--slurm-time "$PROOF_VERIFY_TIME" \
--slurm-max-concurrent "$PROOF_VERIFY_MAX_CONCURRENT"
The run_loop_a.sh will automatically call validation after each iteration too. (only tested on rescale env, further testing needed.)
After all validation jobs are done, there will be multiple proof_verification_xxx.json in outputs/tag/ directory. All invalid (solver,formula) runs for an iteration will be recorded in proof_verification_invalid.json. They should be empty for valid solvers.
Complete results including verification timeout and verification mem-kill warning will be in proof_verification.json.