LLM-SAT

Evolutionary search over SAT solver restart heuristics using LLMs. The pipeline generates candidate heuristic function implementations via Gemini, evaluates them on SAT Competition 2025 benchmarks via SLURM, promotes the best, and evolves new candidates through LLM-guided genetic crossover.

Summary Run

PYTHONPATH=src N_LEADERS=10 M_VARIANTS=10 ./run_loop_a.sh cc experiment 3 --init

# Outputs will be experiment_iter0, experiment_iter1, _iter2, experiment_iter3, which gets fed into genetic evolution next

PYTHONPATH=src TOP_K=5 PAR2_KEEP_TOP_N=10 ./run_bridge.sh cc experiment_iter3 experiment_gen1

# Now our best algos (ideally) are in experiment_gen1

# Go back to iterative member loop:
PYTHONPATH=src M_VARIANTS=5 ./run_loop_a.sh cc experiment_gen1 3

# output: 1,2, experiment_gen1_iter3
# More genetic evolution:

PYTHONPATH=src ./run_bridge.sh cc experiment_gen1_iter3 experiment_gen2

Setup

1. Clone and install

git clone https://github.com/ShiminZhang/LLM-SAT.git
cd LLM-SAT
pip install -r requirements.txt
pip install -e .

2. Path configuration

Copy the template and fill in your cluster-specific paths:

cp path_config.template.yaml path_config.yaml

Edit path_config.yaml:

base_solver: "/home/you/scratch/LLM-SAT/solvers/base"
python_activate: "~/your-venv/bin/activate"

This file is gitignored. All scripts and Python modules read paths from here, so you only set them once.

3. Environment variables

Create a .env file in the project root:

DB_PASS="<postgres password>"
OPENAI_API_KEY="<openai key>"
GOOGLE_API_KEY="<google/gemini key>"

DB_PASS is for the shared PostgreSQL database that stores algorithms, code, and scores. OPENAI_API_KEY is used by the genetic evolution pipeline. GOOGLE_API_KEY is used by data generation (Gemini batch API).

4. Base solver

The base solver is AE_kissatMAB. Place the AE_kissat2025_MAB.tar.xz tarball in the repo root and run:

bash setup_aemab.sh

This extracts the solver to your configured base_solver path, copies the function registry, and runs ./configure.

5. Baseline evaluation (for PAR2 normalization)

To compare results across different clusters, run the baseline solver and record its PAR2:

# Submit baseline evaluation jobs
python scripts/evaluate_baseline.py (--quick-eval)

# After SLURM jobs complete, collect results
python scripts/evaluate_baseline.py --collect (--quick-eval)

The script will print the baseline PAR2. Add it to your path_config.yaml:

baseline_par2: 1234.56  # Replace with your actual value

This enables normalized PAR2 scores (1.0 = same as baseline, lower is better).

6. Benchmarks

Place track_main_2025.uri in the repo root (the SAT Competition 2025 URI list), then:

bash scripts/download_satcomp2025.sh

Downloads and extracts ~400 CNF files to data/benchmarks/satcomp2025/.

7. Function registry

The file solvers/base/function_registry.yaml tells the evaluation pipeline which C function to replace and where it lives in the source. It currently targets kissat_restarting and restart_mab in src/restart.c:

functions:
  kissat_restarting:
    file: "src/restart.c"
    start_line: 15
    end_line: 38
    signature: "bool kissat_restarting(kissat *solver)"

To target a different function, use configure_target.py:

# Switch to a different function (auto-detects source file)
python scripts/configure_target.py kissat_restarting

# Specify the source file explicitly
python scripts/configure_target.py restart_mab --file src/restart.c

# Use a different solver path
python scripts/configure_target.py my_func --solver /path/to/solver

This updates the function registry, rewrites the prompt templates (leader_prompt_testing.txt and coder_prompt_testing.txt) with the correct target name, signature, and embedded source file. Without this script you'd need to manually edit both prompts and the registry.

SLURM Configuration

Update the SLURM settings in src/llmsat/pipelines/evaluation.py for your cluster:

Running the Pipeline

The pipeline has two main phases that alternate:

Loop A (run_loop_a.sh) — Leader refinement: generates mutant variants of leaders, evaluates them, and promotes the best
Bridge (run_bridge.sh) — Genetic evolution: combines top leaders via LLM-guided crossover to produce new offspring

A complete evolution cycle is just 3 commands:

./run_loop_a.sh cc my_tag 3 --init       # Initialize + 3 refinement iterations
./run_bridge.sh cc my_tag_iter3 my_gen1  # Genetic crossover
./run_loop_a.sh cc my_gen1 3             # Continue refining the offspring

Loop A: Leader Refinement (`run_loop_a.sh`)

Runs N iterations of: generate mutant variants → SLURM evaluation → collect results → promote best member to leader.

Usage:

./run_loop_a.sh <cc|nersc> <base_tag> <n_iterations> [source_tag] [--init]

Arguments:

Argument	Description
`cc\|nersc`	Cluster: `cc` uses `evaluation.py` (Compute Canada), `nersc` uses `evaluation_nersc.py` (Perlmutter, packs 128 solver runs per node)
`base_tag`	Base name for iteration tags (`{base_tag}_iter1`, `_iter2`, ...)
`n_iterations`	Number of mutate→evaluate→promote cycles
`source_tag`	(Optional) Tag to load initial leaders from. Defaults to `{base_tag}`
`--init`	(Optional) Generate initial leaders + members + code before starting iterations

Environment variables:

Variable	Default	Description
`M_VARIANTS`	3	Number of mutant variants per leader
`MODEL`	`gemini-3-flash-preview`	LLM model for generation
`POLL_INTERVAL`	120	Seconds between SLURM job status checks
`N_LEADERS`	5	(Init mode only) Number of leaders to generate
`DESIGNER_PROMPT`	`data/prompts/leader_prompt_testing.txt`	(Init mode only) Path to leader prompt

Examples:

# Initialize a new population and run 3 refinement iterations
./run_loop_a.sh cc gemini_trial5 3 --init

# Continue refining from existing leaders (no init)
./run_loop_a.sh cc gemini_trial5 2 gemini_trial5_iter3

# Use custom settings
N_LEADERS=10 M_VARIANTS=5 ./run_loop_a.sh cc gemini_trial5 3 --init

What happens with --init:

Generates N_LEADERS leader algorithms (natural language)
Generates M_VARIANTS member variants per leader
Translates all algorithms to C code
Builds solvers and submits SLURM evaluation jobs
Polls SLURM until all jobs complete
Collects PAR2 scores
Promotes best member to leader in each team
Proceeds to mutation iterations 1..N

What happens in each iteration:

Generates M_VARIANTS mutant variants for each leader
Builds and submits SLURM evaluation (skips already-evaluated leaders)
Polls until jobs complete
Collects PAR2 scores
Promotes best member to leader

Bridge: Genetic Evolution (`run_bridge.sh`)

Promotes top offspring from a refined population, runs LLM-guided genetic crossover to combine leaders, evaluates the offspring, and collects results.

Usage:

./run_bridge.sh <cc|nersc> <input_tag> <output_tag>

Arguments:

Argument	Description
`cc\|nersc`	Cluster: `cc` for Compute Canada, `nersc` passes `--nersc` to `genetic_evolution.py` to use the NERSC evaluation backend
`input_tag`	Tag to read evaluated leaders from (scans `_iter1`, `_iter2`, ... automatically)
`output_tag`	Tag for the offspring population

Environment variables:

Variable	Default	Description
`TOP_K`	5	LLM combination proposals per minibatch
`MINIBATCH_SIZE`	10	Leaders per LLM proposal call
`RUBRIC_MIN`	6.0	Minimum proposal score to proceed
`RUBRIC_KEEP_TOP_N`	10	Keep top-N proposals after score filter
`SHUFFLE_PASSES`	1	Number of shuffled minibatch passes
`MODEL`	`gemini-3-flash-preview`	LLM model
`PAR2_KEEP_TOP_N`	7	Keep top-N offspring by PAR2 score
`POLL_INTERVAL`	120	Seconds between SLURM job status checks

Example:

# Run genetic evolution on refined leaders from iter3
TOP_K=3 MINIBATCH_SIZE=5 ./run_bridge.sh cc my_tag_iter3 my_gen1

What happens:

Scans all {input_tag}_iter* directories for evaluated leaders
Selects top leaders by PAR2 score
LLM analyzes each leader (strengths, weaknesses, key mechanisms)
LLM proposes leader pairs to combine
LLM generates offspring algorithms via crossover
Translates offspring to C code
Builds and submits SLURM evaluation
Polls until all jobs complete
Collects PAR2 scores and keeps top offspring

Full Pipeline Example

Here's a complete end-to-end run evolving SAT solver heuristics:

# 1. Initialize population with 5 leaders, 3 variants each, run 3 refinement iterations
N_LEADERS=5 M_VARIANTS=3 ./run_loop_a.sh cc experiment1 3 --init

# Output: experiment1_iter0 (initial), experiment1_iter1, _iter2, _iter3 (refined)
# Best leaders are in experiment1_iter3

# 2. Run genetic evolution to combine top leaders into new offspring
TOP_K=5 PAR2_KEEP_TOP_N=7 ./run_bridge.sh cc experiment1_iter3 experiment1_gen1

# Output: experiment1_gen1 (offspring from crossover)

# 3. Continue refining the new generation
M_VARIANTS=3 ./run_loop_a.sh cc experiment1_gen1 3

# Output: experiment1_gen1_iter1, _iter2, _iter3

# 4. Run another round of genetic evolution
./run_bridge.sh cc experiment1_gen1_iter3 experiment1_gen2

# ... and so on

Results location:

Solver binaries and code: solvers/<TAG>/{leaders,members}/algorithm_<ID>/code_<ID>/
Per-instance times: results/solving_times_<code_id>.json
PAR2 breakdown: results/par2_breakdown_<code_id>.json
Solver statistics: results/solver_stats_<code_id>.json

PAR2 scoring:

The PAR2 score is the average solving time across all benchmark instances. Unsolved instances (timeout, crash, OOM) receive a penalty of 2× the timeout:

Quick eval (--quick-eval): 50 CNFs, 600s timeout → 1200s penalty
Full eval: 400 CNFs, 5000s timeout → 10000s penalty

Prompts

Prompt files live in data/prompts/:

File	Purpose
`leader_prompt_testing.txt`	Designer prompt for generating leader algorithms
`variant_prompt.txt`	Template for generating member variants (uses `{leader_algorithm}` and `{target_step_num}` placeholders)
`coder_prompt_testing.txt`	Template for translating algorithms to C code (uses `ALGORITHM_PLACEHOLDER`)

Validation

To run evaluation and validation on existing best solvers from iterations:

PYTHONPATH=src ./run_loop_eval_success.sh <tag> <cluster> --no-clean

To run validation only:

python scripts/verify_iteration_proofs.py \
    --submit-slurm \
    --generation_tag "$generation_tag" \
    --benchmark_path data/benchmarks/satcomp2025 \
    --drat_trim "$DRAT_TRIM_CMD" \
    --check_timeout "$PROOF_CHECK_TIMEOUT" \
    --slurm-mem "$PROOF_VERIFY_MEM" \
    --slurm-time "$PROOF_VERIFY_TIME" \
    --slurm-max-concurrent "$PROOF_VERIFY_MAX_CONCURRENT"

The run_loop_a.sh will automatically call validation after each iteration too. (only tested on rescale env, further testing needed.)

After all validation jobs are done, there will be multiple proof_verification_xxx.json in outputs/tag/ directory. All invalid (solver,formula) runs for an iteration will be recorded in proof_verification_invalid.json. They should be empty for valid solvers.

Complete results including verification timeout and verification mem-kill warning will be in proof_verification.json.

Name		Name	Last commit message	Last commit date
Latest commit History 225 Commits
data		data
docs		docs
external/drat-trim		external/drat-trim
outputs		outputs
plots		plots
scripts		scripts
solvers		solvers
src		src
.codex		.codex
.gitignore		.gitignore
.gitmodules		.gitmodules
AE_kissat2025_MAB.tar.xz		AE_kissat2025_MAB.tar.xz
Kissat-CURE.tar.xz		Kissat-CURE.tar.xz
Kissat_CoRephase_CoReward.tar.xz		Kissat_CoRephase_CoReward.tar.xz
Kissat_MAB_CoRephase.tar.xz		Kissat_MAB_CoRephase.tar.xz
LICENSE		LICENSE
README.md		README.md
function_registry.yaml		function_registry.yaml
kissat-pred.tar.xz		kissat-pred.tar.xz
path_config.template.yaml		path_config.template.yaml
requirements.txt		requirements.txt
run_bridge.sh		run_bridge.sh
run_ge_collect.sh		run_ge_collect.sh
run_loop_a.sh		run_loop_a.sh
run_loop_eval_success.sh		run_loop_eval_success.sh
run_loop_reuse.sh		run_loop_reuse.sh
setup_aemab.sh		setup_aemab.sh
setup_solver.sh		setup_solver.sh
track_main_2025.uri		track_main_2025.uri
vsa.tar.xz		vsa.tar.xz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-SAT

Summary Run

Setup

1. Clone and install

2. Path configuration

3. Environment variables

4. Base solver

5. Baseline evaluation (for PAR2 normalization)

6. Benchmarks

7. Function registry

SLURM Configuration

Running the Pipeline

Loop A: Leader Refinement (`run_loop_a.sh`)

Bridge: Genetic Evolution (`run_bridge.sh`)

Full Pipeline Example

Prompts

Validation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM-SAT

Summary Run

Setup

1. Clone and install

2. Path configuration

3. Environment variables

4. Base solver

5. Baseline evaluation (for PAR2 normalization)

6. Benchmarks

7. Function registry

SLURM Configuration

Running the Pipeline

Loop A: Leader Refinement (run_loop_a.sh)

Bridge: Genetic Evolution (run_bridge.sh)

Full Pipeline Example

Prompts

Validation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Loop A: Leader Refinement (`run_loop_a.sh`)

Bridge: Genetic Evolution (`run_bridge.sh`)

Packages