CEGVR · Counterexample-Grounded Verifiable Reasoning

§ 01 · Abstract

The central question.

Not whether an LLM can emit a candidate, but whether solver-grounded feedback measurably improves verified performance over mere repeated attempts under matched compute budgets.

We present a candidate-first linear evaluation protocol for LLM + SMT verifiable reasoning under explicit feedback-granularity control. The main study compares five methods — a one-shot reference, three compute-matched retry baselines, and a new conflict-directed verifier-guided search method (cd_vgs_core_rank) that uses unsat-core structure for candidate ranking and targeted repair.

The candidate contract is strict: the generator either returns a full satisfying assignment or an UNSAT claim, both of which the solver certifies or refutes. The flagship method exploits the solver's conflict core as natural-language hints the LLM can consume in the next round — the feedback is the supervision signal.

The implementation preserves a legacy trace-based architecture for transfer evidence (Sudoku 4×4 appendix) while moving the main paper path to the stricter candidate-verification interface. All experiments are reproducible locally with a single command.

Arms

5

one_shot, three retry baselines, cd_vgs_core_rank

Budget

K=4 × R=4

compute-matched across arms; thinking off

Dataset

500

linear feasibility problems with SAT/UNSAT labels

§ 02 · The Loop

Counterexample-grounded refinement.

A local LLM emits a structured candidate. Z3 certifies the assignment or returns an unsat-core — the minimal conflict among the violated constraints. That core becomes a prompt-level hint for the next round; the search is steered by the solver, not the sampler.

Figure 1 The CEGVR loop. A candidate structural output from the LLM is handed to Z3 for certification. On refutation, the solver's unsat-core — the minimal set of conflicting constraints — is rendered as natural-language hints and fed back into the next round's prompt. Budget is fixed at K × R; feedback granularity is the experimental knob.

§ 03 · Methods

Five arms, matched compute.

The primary causal comparison is among the four compute-matched multi-round methods. They share identical K, R, timeouts, parsing rules, and generation settings — the retry baselines differ only in carried-forward verifier feedback, while cd_vgs_core_rank reallocates the same budget via conflict-directed search.

#	Method	Description
01	one_shot	baselineSingle attempt, no retries. Retained as a repair-gain reference point for the multi-round arms.
02	multi_no_feedback	retryCompute-matched repeated attempts with independent sampling. No verifier signal is carried forward — tests whether more tries alone close the gap.
03	multi_generic_feedback	retrySame budget, but the prompt carries a generic hint ("the previous assignment did not satisfy all constraints"). Isolates the effect of knowing a failure occurred.
04	multi_unsat_core_feedback	retryThe prompt carries the unsat-core — the concrete set of violated constraints returned by Z3 — as natural-language hints. Isolates the effect of solver-grounded diagnostic richness.
05	cd_vgs_core_rank Flagship	cegvrConflict-Directed Verifier-Guided Search with CoreRank candidate ordering. Same `K × R` budget is redistributed: the solver's conflict core drives which candidate variables to revise vs. keep fixed, and unsat-core overlap ranks subsequent candidates.

§ 04 · Worked Example

A single problem, two rounds.

A representative 4-variable linear system. The LLM's first candidate violates two constraints; the unsat-core narrows the revision to a specific variable subset, and the next round produces a certified satisfying assignment.

—

Input

Problem · linear feasibility lin-0127

# variables: x1,x2,x3,x4 ∈ {0..9} # constraints: c1: x1 + 4·x2 + x3 + 2·x4 ≥ 23 c2: 3·x1 + x2 = 43 c3: x2 + 2·x3 - x4 ≤ 6 c4: x1 - x3 + x4 ≥ 1

1

Round

LLM candidate Qwen · K=1

{ "status": "sat", "assignment": { "x1": 4, "x2": 2, "x3": 3, "x4": 7 } }

Z3 verification unsat-core

✗ counterexample # candidate violates: c2: 3·4 + 2 = 14 ≠ 43 unsat_core = [c2] # core variables: {x1, x2} → revise {x3, x4} → keep

2

Round

LLM candidate (CoreRank) Qwen · K=1

# hint: revise x1,x2 so 3·x1+x2=43 # candidates ranked by core-overlap: { "status": "sat", "assignment": { "x1": 14, "x2": 1, "x3": 3, "x4": 7 } }

Z3 verification certified

✓ verified solve # all constraints satisfied: c1: 14+4+3+14 = 35 ≥ 23 ✓ c2: 42 + 1 = 43 = 43 ✓ c3: 1 + 6 - 7 = 0 ≤ 6 ✓ c4: 14-3 + 7 = 18 ≥ 1 ✓ solver_latency = 1.2 ms llm_attempts = 2

§ 05 · Key Findings

Four takeaways.

Observations from the five-method linear study. The metrics panel covers verified solve rate, SAT certification rate, solver calls per certified solve, and McNemar-paired significance across the compute-matched arms.

i Feedback granularity

Unsat-core hints are natural-language signals the LLM can consume directly in the next prompt — no retraining, no tool-use schema change.

The richer the diagnostic (generic → unsat-core → structured core+rank), the faster the search converges under a fixed budget.

ii Repair > retry

At matched compute, solver-grounded repair beats naive retry: cd_vgs_core_rank and unsat_core_feedback Pareto-dominate the no-feedback arm on verified solve rate.

The effect is measurable under McNemar-paired tests across the four compute-matched arms. More attempts alone do not close the gap.

iii Latency is predictable

Latency scales linearly with rounds — the solver overhead is bounded (Z3 in ms on this dataset). The expensive path is LLM inference.

The efficiency frontier plots verified-solve-rate against mean solver calls, exposing where the flagship saves compute vs. the retry baselines.

iv Transfer is plausible

The preserved trace-based architecture demonstrates transfer to Sudoku 4×4 as appendix evidence, hinting the protocol generalises beyond linear feasibility.

Main claims remain tied to the linear benchmark; the Sudoku appendix is descriptive transfer evidence only.

§ 06 · Experiment Results

1 500 problems × 5 arms × 3 seeds — verified.

Final numbers from the Qwen3-30B-A3B run on an A100-80 GB Colab, each arm evaluated on 500 linear-arithmetic SMT problems over 3 seeds. Click an arm to see its certified accuracy, SAT/UNSAT breakdown, and budget-exceeded count.

Certified accuracy

—

All arms comparison

§ 07 · Robustness

How stable is the gain?

Delta-plots under problem-level perturbations: coefficient scaling, constraint permutation, and variable relabeling. A method is robust if its verified solve rate stays put under superficial re-encodings of the same problem.

Figure 2 *Robustness deltas across perturbation families*. Each panel shows verified-solve-rate change under a family of input perturbations; bars are per-method mean shifts with bootstrap intervals. Solver-grounded arms absorb permutations and renamings with substantially smaller degradations than the no-feedback retry baseline.

§ 07 · Reproducibility

One command, one paper.

The main paper path is executed by a single shell script. It bootstraps Python 3.11, (re)generates the linear benchmark, runs the five-method arm matrix, and rebuilds the figures, tables, and LaTeX PDF end-to-end. The paper fails closed if the local llama.cpp endpoint is unavailable — no silent stub fallback.

scripts/run_all.sh — one-command paper pipeline

# 1 · bootstrap a local Python 3.11 environment $ bash scripts/bootstrap_local_mac.sh .venv $ source .venv/bin/activate # 2 · serve Qwen 3.5-35B locally via llama.cpp on :8000 $ llama-server -m /path/to/qwen3.5-35b-a3b.gguf --port 8000 --ctx-size 8192 # 3 · run the full paper pipeline (linear main + sudoku appendix + figures + tables + PDF) $ bash scripts/run_all.sh --seeds 3 --max-rounds 4 --budget 4 --timeout-ms 1500 # or a single arm: $ cegvr eval \ --pipeline candidate \ --arm cd_vgs_core_rank \ --problems data/linear/problems.jsonl \ --out runs/linear_main/cd_vgs_core_rank \ --seeds 3 --max-rounds 4 --budget 4 \ --llm-endpoint http://127.0.0.1:8000/v1/chat/completions \ --llm-model qwen3.5-35b-a3b # 4 · inspect aggregated tables, figures, and paper assets $ cegvr summarize --runs runs/linear_main $ cegvr make-figures --runs runs/linear_main --out examples/paper_assets/figures_linear $ open examples/paper_assets/latex/main.pdf

Solver-grounded verifiable reasoning with a candidate-first LLM + SMT protocol.