formal-verificationai-engineeringsoftware-qualityleandafnyverification

Vericoding: Why Formal Verification Is About to Go Mainstream

AI is writing more code, and formal verification is becoming the quality gate teams can no longer ignore.

April 2, 2026 11 min read

On this page

What vericoding means, and why it matters now

“Vericoding” is the idea that software is no longer judged only by tests, code review, and runtime monitoring. It is judged by whether the intended behavior can be proved.

The term was introduced in a September 2025 paper by Bursuc et al. (BAIF + MIT, including Max Tegmark): “A benchmark for vericoding: formally verified program synthesis”. They define it as LLM-generation of formally verified code from formal specifications, as the counterpart to “vibe coding,” which generates potentially broken code from natural language.

That framing matters because it captures what has changed. AI systems are now generating a meaningful share of production code. At the same time, formal verification tools have become more practical, more integrated, and more visible to engineers outside academia. The shift is not that proofs suddenly got easier. The shift is that LLMs can now sit in the middle of the proof process and reduce the cost of writing and maintaining proofs.

Martin Kleppmann published a clear prediction on 8 December 2025: AI will make formal verification go mainstream. His argument is threefold. First, formal verification will become drastically cheaper. Second, AI-generated code needs formal verification so that human review can be skipped safely. Third, the precision of formal verification complements the probabilistic nature of LLMs. He makes the point explicitly: it does not matter if LLMs hallucinate, because the proof checker will reject invalid proofs.

That is stronger than “worth taking seriously.” It is a direct prediction that these two fields are converging into a practical engineering discipline.

This is the opposite of vibecoding. Vibecoding accepts uncertainty and moves fast. Vericoding accepts that when software is high stakes, speed without guarantees is debt.

Why formal verification never became mainstream

The honest answer is simple. It was too expensive, too slow, and too academic.

Barrier	What happened
Manual proof burden	Engineers had to express precise specs, invariants, pre/postconditions, and then prove the implementation matches. More work than writing the code itself
Wrong target audience	Tools were optimized for mathematicians and niche high-assurance teams. Product teams needed speed, not theorem proving
Fragmented toolchain	Teams had to choose between language features, separate proof assistants, custom annotations, and special-purpose frameworks that sat outside the normal developer loop

The tools were always powerful

Dafny is a verification-aware programming language with native support for specifications and a static program verifier. It compiles to C#, Java, JavaScript, Go, and Python, and ships with IDE plugins, a language server, and tooling designed to fit development workflows.

Lean is an open-source programming language and proof assistant. Its core design emphasizes a minimal trusted kernel for absolute correctness, powerful automation, and extensibility for mathematical and software verification.

Nagini is an automated verifier for statically-typed, concurrent Python 3 programs, built on the Viper verification infrastructure. It can verify memory safety, functional properties, termination, deadlock freedom, and input/output behavior. Source: Eilers and Mueller, CAV 2018.

The value was real. The adoption path was steep.

The turning point: LLMs as proof assistants

The most important change is not that LLMs are smarter than proofs. They are not. The change is that they can absorb the repetitive, error-prone part of proof construction.

That matters because most verification work is not about deep insight every minute. It is about getting the specification shape right, decomposing the theorem, finding the right lemma, and iterating when the verifier rejects the first attempt.

A March 2025 paper from JetBrains Research (“Can LLMs Enable Verification in Mainstream Programming?”) tested LLMs on verified code synthesis across three verification languages: Dafny, Nagini, and Verus. Claude 3.5 Sonnet massively surpassed other models (GPT-3.5, GPT-4o, Claude 3 Opus), leading the researchers to focus exclusively on it. The practical implication is straightforward: LLMs are becoming proof assistants, not proof substitutes. They can propose invariants, fill in boilerplate, and help translate programmer intent into verifier-friendly structure. The human still owns the specification, the model checking, and the final judgment.

What changed technically

Three things changed at once:

Change	Why it matters
LLMs got better at syntax and idiom	They can now write code in proof-oriented languages without complete nonsense in every other line
Verification tools improved their UX	Modern Lean, Dafny, and related tools are more scriptable, better documented, and more integrated into editor workflows
The failure cost of normal software got higher	AI-generated code scales output but also scales mistakes, especially in authorization, state transitions, concurrency, and data handling

Once those three curves intersect, proof tooling becomes economically interesting.

The current stack: Lean, Dafny, and Nagini

A useful way to think about vericoding is to separate the proof language from the production language.

Tool	Strength	Best for	Source
Lean	Powerful theorem proving, minimal trusted kernel, strong automation	Core logic, algorithms, protocol properties, mathematical verification	lean-lang.org
Dafny	Code and specs in one place, compiles to C#/Java/JS/Go/Python, IDE integration	Application-level verification, parsers, critical subsystems	dafny.org
Nagini	Python verification, concurrency support, Viper-based	Python codebases needing stronger guarantees in specific modules	GitHub

Lean is especially strong when the specification problem is central and you want powerful theorem proving infrastructure. It is not the first tool most application teams will adopt for everyday service code, but it may become the strongest tool for verifying core logic, algorithms, and protocol properties.

Dafny is closer to the application engineer’s world. It lets you write code and specifications in one place, then verify that the implementation meets them. That lowers the distance between proof and production. A team can verify an algorithm, a parser, or a critical subsystem without living entirely inside proof scripts.

Nagini verifies Python programs with a focus on memory safety, functional properties, termination, and deadlock freedom for concurrent code. That makes it relevant for teams that want to stay in Python but need stronger guarantees in specific parts of the system.

The startups making this real

Academic progress matters, but mainstream adoption usually needs product pressure.

Company	What they do	Why it matters	Source
Harmonic	Built Aristotle, an AI system that translates natural-language math problems into formally verifiable proofs in Lean 4. Claimed gold-medal-level IMO performance. $1.45B valuation, Series C ($120M+). Co-founded by Vlad Tenev (Robinhood CEO)	Proof that formal reasoning scales commercially. Positions “hallucination-free” mathematical AI as a product category	harmonic.fun, Yahoo Finance
Logical Intelligence	Building energy-based reasoning models (not LLM-based) that are deterministic and mathematically verifiable. Products: Aleph (formal verification + code generation with machine-checkable proofs), Kona (energy-based reasoning platform). Focus: autonomy, finance, defense. Yann LeCun on advisory board	A different approach: deterministic AI for critical systems, explicitly positioned as an alternative to probabilistic LLMs	logicalintelligence.com, BusinessWire
DeepSeek (Prover-V2)	Open-source LLM for formal theorem proving in Lean 4. 671B parameter variant achieves 88.9% on MiniF2F-test and solves 49/658 PutnamBench problems. Uses reinforcement learning with subgoal decomposition	Proof generation treated as a dedicated model capability. When model families are explicitly trained for proof tasks, the old argument that proof writing will always be too manual starts to weaken	arXiv:2504.21801, GitHub

The pattern across all three: verification is not just a research demo anymore. It is being productized.

The caveat is obvious. A prover model is not the same thing as a verified program. It can propose proof steps, but the checker still decides whether the proof is valid. That separation is exactly why vericoding is viable. The model can assist. The verifier can judge.

A small example: unit test versus proof

The easiest way to see the difference is to compare a normal test with a property proof.

Unit test

A test checks one example.

def test_parse_empty_list():
    assert parse_int_list("[]") == []

Useful, but limited. It tells you one case works.

Proof-oriented thinking

In Dafny, you can express a property, not just an example:

function SumSeq(xs: seq<int>): int
  decreases |xs|
{
  if |xs| == 0 then 0
  else xs[0] + SumSeq(xs[1..])
}

method Sum(xs: seq<int>) returns (s: int)
  ensures s == SumSeq(xs)
{
  var i := 0;
  s := 0;
  while i < |xs|
    invariant 0 <= i <= |xs|
    invariant s == SumSeq(xs[..i])
  {
    s := s + xs[i];
    i := i + 1;
  }
}

The function SumSeq defines what the sum means. The method Sum computes it. The verifier checks that they agree for all possible inputs. If the loop invariant or postcondition is wrong, the verifier rejects the code at compile time, not at runtime.

This is not about making the code shorter. It is about making the intended behavior explicit enough that the verifier can check it.

For engineering teams, the key shift is this: tests tell you what you checked. Proofs tell you what cannot happen, within the model you specified.

That is a different quality gate.

What this means for engineering teams

If vericoding goes mainstream, it will not mean every repository becomes a proof project. It will mean teams start separating code into tiers of assurance.

Assurance tiers

Tier	Examples	Verification approach
Ordinary product code	UI code, glue code, internal scripts, non-critical services	Test-driven. Formal verification is overkill
Important invariants	Auth/authz logic, payment state transitions, concurrency-critical code, parsers, protocol logic, boundary checks, algorithms with costly edge cases	Tests are valuable but incomplete. This is where adoption will happen first
Safety and correctness critical	Compilers, cryptography, medical devices, industrial control, infrastructure automation	Proofs should become standard practice

Tier 2 is the interesting zone. These are the places where AI-generated code can silently introduce subtle errors, and where tests are valuable but do not cover the full state space.

Practical adoption path

If you are an engineering lead, the first change is not “hire proof engineers.” The first change is to identify the code paths where a proof would pay for itself.

Step	Action
1	Pick one invariant that already hurts you. “Balance never goes negative,” “state machine only advances forward,” “permission checks happen before side effects”
2	Rewrite that invariant in a verification-friendly form. Preconditions, postconditions, or a small algebraic model
3	Let the LLM draft the proof skeleton. Treat it like a senior assistant, not an oracle
4	Make the verifier the final authority. If it does not prove, it is not done
5	Track the cost. If the proof costs more than the bug it prevents, do not expand the scope yet

That is how formal verification moves from ideology to engineering economics.

Why this may happen sooner than people expect

The historical blocker was not philosophical disagreement. It was developer time.

That blocker is changing because AI lowers the marginal cost of proof iteration. A human engineer still has to understand the property, but the mechanical work of trying lemmas, massaging invariants, and recovering from failed proof attempts can now be offloaded in part.

This is exactly why Kleppmann’s framing matters. The conversation is no longer “Can formal verification replace testing?” That was never the right question. The real question is “Can AI make formal verification cheap enough for selected, high-value parts of ordinary software?” The answer is increasingly yes.

The reason this can go mainstream is not that proofs got less hard. It is that proof assistance got good enough to compress the hard part.

Honest take: what is still missing

There is a lot of hype around formal verification right now, and some of it should be resisted.

Limitation	Why it matters
Proofs do not save bad specifications	A verifier proves the spec you wrote, not the intent you had in your head. If the spec is wrong, the proof is faithfully wrong. AI does not remove this problem
LLMs are not trusted proof engines	LLMs hallucinate. They can produce plausible proof text that fails immediately under the checker. The model is a productivity tool, not an authority
Adoption will be uneven	Infrastructure, security, compilers, and finance teams will adopt first. Many teams will never need it. That is normal
Tooling still needs ergonomics work	Even with Lean, Dafny, and related systems, onboarding cost is high. Better IDE feedback, proof debugging, and model assistance will matter more than grand theory
The mainstream story is partial, not total	The likely future is not “all software is formally verified.” It is “critical parts of many systems are verified, and LLMs make that practical enough to spread”

That last point is still a major shift.

Closing thought

Vericoding is not a slogan. It is a likely next step in how serious software teams manage risk in an AI-heavy development world.

Tests will stay. Reviews will stay. Observability will stay. But for the parts of software where failure is expensive and invariants matter, proof-assisted development is becoming plausible in a way it simply was not a few years ago.

The first teams to adopt it will not be the ones trying to prove everything. They will be the ones who realize that a verifier plus an LLM is sometimes cheaper than a long tail of incidents, regressions, and defensive code.

And that is when formal verification stops being academic and starts being operational.

Sources

Source	URL
Bursuc et al., “A benchmark for vericoding” (Sept 2025)	arxiv.org
Martin Kleppmann, “AI will make formal verification go mainstream” (Dec 2025)	martin.kleppmann.com
JetBrains Research, “Can LLMs Enable Verification in Mainstream Programming?” (March 2025)	arxiv.org
Harmonic / Aristotle	harmonic.fun, aristotle.harmonic.fun
Logical Intelligence	logicalintelligence.com
DeepSeek-Prover-V2 (April 2025)	arxiv.org
Dafny	dafny.org
Lean	lean-lang.org
Nagini (Eilers and Mueller, CAV 2018)	GitHub, ETH Zurich

René Murrell

AI Engineer · Berlin · Building in public

GitHub →