Verina: Benchmarking Verifiable Code Generation

Abstract

Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation—jointly generating code, specifications, and proofs of code-specification alignment—offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often lack support for end-to-end verifiable code generation. In this paper, we introduce VERINA (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. VERINA consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, generates only 61.4% correct code, 51.0% sound and complete specifications, and 3.6% successful proofs, with one trial per task. We hope VERINA will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark.

Overview

Verifiable code generation represents an emerging programming paradigm where large language models jointly generate:

📝 Code Implementation

Functional code that solves the programming problem. Given a natural language description and function signature, models generate executable code implementations. Model can optionally use existing specification as context.

📋 Formal Specifications

Pre-conditions and post-conditions that formally describe expected behavior. Models can optionally use existing code as context.

✅ Formal Proofs

Machine-checkable proofs that verify code-specification alignment. Given code and specifications, models construct logical arguments for correctness.

Figure: VERINA's three foundational tasks for verifiable code generation. Dashed arrows represent optional inputs that can be provided to guide the generation process.

This approach offers higher levels of correctness assurance and automation in software development, potentially addressing the productivity bottleneck caused by bugs and security vulnerabilities in LLM-generated code.

VERINA establishes a foundation for advancing verifiable code generation research , providing both rigorous evaluation capabilities and clear directions toward more reliable and formally verified automated programming systems. Our findings underscore the critical need for improving LLM-based theorem proving capabilities in verification domains.

Data Format Example

Each instance in VERINA consists of a comprehensive set of components designed to support rigorous evaluation of verifiable code generation:

-- Natural language description of the coding problem
-- Remove an element from a given array of integers at a specified index...

-- Code implementation
def removeElement (s : Array Int) (k : Nat) (h_precond : removeElement_pre s k) : Array Int :=
  s.eraseIdx! k

-- Pre-condition
def removeElement_pre (s : Array Int) (k : Nat) : Prop :=
  k < s.size -- the index must be smaller than the array size

-- Post-condition  
def removeElement_post (s : Array Int) (k : Nat) (result: Array Int) 
    (h_precond : removeElement_pre s k) : Prop :=
  result.size = s.size - 1 ∧ -- Only one element is removed
  (∀ i, i < k → result[i]! = s[i]!) ∧ -- Elements before index k remain unchanged
  (∀ i, i < result.size → i ≥ k → result[i]! = s[i + 1]!) -- Elements after are shifted

-- Formal proof (establishing code-specification alignment)
theorem removeElement_spec (s: Array Int) (k: Nat) (h_precond : removeElement_pre s k) :
    removeElement_post s k (removeElement s k h_precond) h_precond := by
  unfold removeElement removeElement_postcond
  unfold removeElement_precond at h_precond
  simp_all
  unfold Array.eraseIdx!
  simp [h_precond]
  constructor
  · intro i hi
    have hi' : i < s.size - 1 := by
      have hk := Nat.le_sub_one_of_lt h_precond
      exact Nat.lt_of_lt_of_le hi hk
    rw [Array.getElem!_eq_getD, Array.getElem!_eq_getD]
    unfold Array.getD
    simp [hi', Nat.lt_trans hi h_precond, Array.getElem_eraseIdx, hi]
  · intro i hi hi'
    rw [Array.getElem!_eq_getD, Array.getElem!_eq_getD]
    unfold Array.getD
    have hi'' : i + 1 < s.size := by exact Nat.add_lt_of_lt_sub hi
    simp [hi, hi'']
    have : ¬ i < k := by simp [hi']
    simp [Array.getElem_eraseIdx, this]

-- Comprehensive test cases (both positive and negative)
(s : #[1, 2, 3, 4, 5]) (k : 2) (result : #[1, 2, 4, 5]) -- Positive test with valid inputs and output
(s : #[1, 2, 3, 4, 5]) (k : 5) -- Negative test: inputs violate the pre-condition at Line 12
(s : #[1, 2, 3, 4, 5]) (k : 2) (result : #[1, 2, 4]) -- Negative test: output violates the first clause of the post-condition
(s : #[1, 2, 3, 4, 5]) (k : 2) (result : #[2, 2, 4, 5]) -- Negative test: output violates the second clause of the post-condition at Line 17
(s : #[1, 2, 3, 4, 5]) (k : 2) (result : #[1, 2, 4, 4]) -- Negative test: output violates the third clause of the post-condition at Line 18

Key Features of VERINA

📊 Comprehensive Coverage

189 manually curated tasks with varying difficulty levels, splitted into VERINA-BASIC (108 tasks) and VERINA-ADV (81 tasks)

🔧 Modular Design

Supports evaluation of individual tasks (code, spec, proof generation) and their flexible combinations

🧪 High-quality Test Suites

100% line coverage with comprehensive positive and negative test cases for robust evaluation

📏 Novel Metrics

Automated specification evaluation measuring soundness and completeness through testing-based framework

Specification Evaluation Framework

VERINA introduces a novel automated evaluation framework for model-generated specifications, assessing their soundness and completeness with respect to ground truth specifications through comprehensive testing.

Figure: Our evaluator for specification generation, using testing-based approach to verify soundness and completeness relationships

✅ Soundness

A specification is sound if it only accepts correct programs. For pre-conditions: ground truth implies model-generated. For post-conditions: model-generated output implies ground truth output.

🔍 Completeness

A specification is complete if it accepts all correct programs. For pre-conditions: model-generated implies ground truth. For post-conditions: ground truth output implies model-generated output.

Our framework leverages comprehensive test suites and Lean's property-based testing to systematically evaluate these relationships, providing robust automatic assessment of specification quality without requiring manual proof construction.

Benchmark Statistics

Metric	Median	Maximum
Words in Description	110	296
Lines of Code (Implementation)	9	38
Lines of Code (Specification)	4	62
Positive Test Cases	5	13
Negative Test Cases	12	27

Our benchmark includes detailed problem descriptions, comprehensive test suites, and formal specifications that capture complex programming tasks while maintaining high quality through manual curation and validation.

Key Evaluation Results

We evaluated 9 state-of-the-art LLMs on VERINA's three foundational tasks. Our results reveal significant challenges in verifiable code generation:

Figure: pass@1 performance of LLMs on VERINA's three foundational tasks

🏆 Best Results (OpenAI o4-mini)

61.4%

Code Generation

51.0%

Specification Generation

3.6%

Proof Generation

Results show a clear difficulty hierarchy, with proof generation being the most challenging task for current LLMs.

Figure: Performance comparison between VERINA-BASIC and VERINA-ADV

VERINA-ADV vs VERINA-BASIC: Substantial performance drops across all tasks for more complex problems, highlighting the challenges LLMs face with increased problem complexity.

📈 Iterative Refinement Results

With 64 iterations of proof refinement using Lean compiler feedback, success rates improved from 3.6% to 22.2% on VERINA-BASIC, but remained low at 6.17% on VERINA-ADV, highlighting the persistent challenges in automated theorem proving.

Figure: Iterative proof refinement results showing improvement with Lean compiler feedback

Supported Task Combinations

VERINA's modular design enables evaluation of various real-world scenarios in verifiable code generation:

🎯 Specification-Guided Code Generation

Given a natural language description and formal specification, generate code and prove its correctness. This scenario reflects development workflows where specifications are available upfront.

🔍 Specification Inference from Code

Given existing code implementation, automatically generate formal specifications and proofs. This addresses legacy code annotation and documentation scenarios.

🚀 End-to-End Verifiable Code Generation

Starting from only a natural language problem description, independently generate code, specifications, and proofs. This represents the highest level of automation in verified software development.

Figure: Combinations of VERINA's foundational tasks showing real-world usage scenarios. Natural language descriptions and function signatures are omitted for brevity.

Figure: Impact of contextual information on combined task performance

BibTeX

@article{ye2025verina,
  title={VERINA: Benchmarking Verifiable Code Generation},
  author={Ye, Zhe and Yan, Zhengxu and He, Jingxuan and Kasriel, Timothe and Yang, Kaiyu and Song, Dawn},
  journal={arXiv preprint arXiv:2505.23135},
  year={2025}
}