What vericoding means, and why it matters now

“Vericoding” is the idea that software is no longer judged only by tests, code review, and runtime monitoring. It is judged by whether the intended behavior can be proved.

The term was introduced in a September 2025 paper by Bursuc et al. (BAIF + MIT, including Max Tegmark): “A benchmark for vericoding: formally verified program synthesis”. They define it as LLM-generation of formally verified code from formal specifications, as the counterpart to “vibe coding,” which generates potentially broken code from natural language.

That framing matters because it captures what has changed. AI systems are now generating a meaningful share of production code. At the same time, formal verification tools have become more practical, more integrated, and more visible to engineers outside academia. The shift is not that proofs suddenly got easier. The shift is that LLMs can now sit in the middle of the proof process and reduce the cost of writing and maintaining proofs.

Martin Kleppmann published a clear prediction on 8 December 2025: AI will make formal verification go mainstream. His argument is threefold. First, formal verification will become drastically cheaper. Second, AI-generated code needs formal verification so that human review can be skipped safely. Third, the precision of formal verification complements the probabilistic nature of LLMs. He makes the point explicitly: it does not matter if LLMs hallucinate, because the proof checker will reject invalid proofs.

That is stronger than “worth taking seriously.” It is a direct prediction that these two fields are converging into a practical engineering discipline.

This is the opposite of vibecoding. Vibecoding accepts uncertainty and moves fast. Vericoding accepts that when software is high stakes, speed without guarantees is debt.

Why formal verification never became mainstream

The honest answer is simple. It was too expensive, too slow, and too academic.

BarrierWhat happened
Manual proof burdenEngineers had to express precise specs, invariants, pre/postconditions, and then prove the implementation matches. More work than writing the code itself
Wrong target audienceTools were optimized for mathematicians and niche high-assurance teams. Product teams needed speed, not theorem proving
Fragmented toolchainTeams had to choose between language features, separate proof assistants, custom annotations, and special-purpose frameworks that sat outside the normal developer loop

The tools were always powerful

Dafny is a verification-aware programming language with native support for specifications and a static program verifier. It compiles to C#, Java, JavaScript, Go, and Python, and ships with IDE plugins, a language server, and tooling designed to fit development workflows.

Lean is an open-source programming language and proof assistant. Its core design emphasizes a minimal trusted kernel for absolute correctness, powerful automation, and extensibility for mathematical and software verification.

Nagini is an automated verifier for statically-typed, concurrent Python 3 programs, built on the Viper verification infrastructure. It can verify memory safety, functional properties, termination, deadlock freedom, and input/output behavior. Source: Eilers and Mueller, CAV 2018.

The value was real. The adoption path was steep.

The turning point: LLMs as proof assistants

The most important change is not that LLMs are smarter than proofs. They are not. The change is that they can absorb the repetitive, error-prone part of proof construction.

That matters because most verification work is not about deep insight every minute. It is about getting the specification shape right, decomposing the theorem, finding the right lemma, and iterating when the verifier rejects the first attempt.

A March 2025 paper from JetBrains Research (“Can LLMs Enable Verification in Mainstream Programming?”) tested LLMs on verified code synthesis across three verification languages: Dafny, Nagini, and Verus. Claude 3.5 Sonnet massively surpassed other models (GPT-3.5, GPT-4o, Claude 3 Opus), leading the researchers to focus exclusively on it. The practical implication is straightforward: LLMs are becoming proof assistants, not proof substitutes. They can propose invariants, fill in boilerplate, and help translate programmer intent into verifier-friendly structure. The human still owns the specification, the model checking, and the final judgment.

What changed technically

Three things changed at once:

ChangeWhy it matters
LLMs got better at syntax and idiomThey can now write code in proof-oriented languages without complete nonsense in every other line
Verification tools improved their UXModern Lean, Dafny, and related tools are more scriptable, better documented, and more integrated into editor workflows
The failure cost of normal software got higherAI-generated code scales output but also scales mistakes, especially in authorization, state transitions, concurrency, and data handling

Once those three curves intersect, proof tooling becomes economically interesting.

The current stack: Lean, Dafny, and Nagini

A useful way to think about vericoding is to separate the proof language from the production language.

ToolStrengthBest forSource
LeanPowerful theorem proving, minimal trusted kernel, strong automationCore logic, algorithms, protocol properties, mathematical verificationlean-lang.org
DafnyCode and specs in one place, compiles to C#/Java/JS/Go/Python, IDE integrationApplication-level verification, parsers, critical subsystemsdafny.org
NaginiPython verification, concurrency support, Viper-basedPython codebases needing stronger guarantees in specific modulesGitHub

Lean is especially strong when the specification problem is central and you want powerful theorem proving infrastructure. It is not the first tool most application teams will adopt for everyday service code, but it may become the strongest tool for verifying core logic, algorithms, and protocol properties.

Dafny is closer to the application engineer’s world. It lets you write code and specifications in one place, then verify that the implementation meets them. That lowers the distance between proof and production. A team can verify an algorithm, a parser, or a critical subsystem without living entirely inside proof scripts.

Nagini verifies Python programs with a focus on memory safety, functional properties, termination, and deadlock freedom for concurrent code. That makes it relevant for teams that want to stay in Python but need stronger guarantees in specific parts of the system.

The startups making this real

Academic progress matters, but mainstream adoption usually needs product pressure.

CompanyWhat they doWhy it mattersSource
HarmonicBuilt Aristotle, an AI system that translates natural-language math problems into formally verifiable proofs in Lean 4. Claimed gold-medal-level IMO performance. $1.45B valuation, Series C ($120M+). Co-founded by Vlad Tenev (Robinhood CEO)Proof that formal reasoning scales commercially. Positions “hallucination-free” mathematical AI as a product categoryharmonic.fun, Yahoo Finance
Logical IntelligenceBuilding energy-based reasoning models (not LLM-based) that are deterministic and mathematically verifiable. Products: Aleph (formal verification + code generation with machine-checkable proofs), Kona (energy-based reasoning platform). Focus: autonomy, finance, defense. Yann LeCun on advisory boardA different approach: deterministic AI for critical systems, explicitly positioned as an alternative to probabilistic LLMslogicalintelligence.com, BusinessWire
DeepSeek (Prover-V2)Open-source LLM for formal theorem proving in Lean 4. 671B parameter variant achieves 88.9% on MiniF2F-test and solves 49/658 PutnamBench problems. Uses reinforcement learning with subgoal decompositionProof generation treated as a dedicated model capability. When model families are explicitly trained for proof tasks, the old argument that proof writing will always be too manual starts to weakenarXiv:2504.21801, GitHub

The pattern across all three: verification is not just a research demo anymore. It is being productized.

The caveat is obvious. A prover model is not the same thing as a verified program. It can propose proof steps, but the checker still decides whether the proof is valid. That separation is exactly why vericoding is viable. The model can assist. The verifier can judge.

A small example: unit test versus proof

The easiest way to see the difference is to compare a normal test with a property proof.

Unit test

A test checks one example.

def test_parse_empty_list():
    assert parse_int_list("[]") == []

Useful, but limited. It tells you one case works.

Proof-oriented thinking

In Dafny, you can express a property, not just an example:

function SumSeq(xs: seq<int>): int
  decreases |xs|
{
  if |xs| == 0 then 0
  else xs[0] + SumSeq(xs[1..])
}

method Sum(xs: seq<int>) returns (s: int)
  ensures s == SumSeq(xs)
{
  var i := 0;
  s := 0;
  while i < |xs|
    invariant 0 <= i <= |xs|
    invariant s == SumSeq(xs[..i])
  {
    s := s + xs[i];
    i := i + 1;
  }
}

The function SumSeq defines what the sum means. The method Sum computes it. The verifier checks that they agree for all possible inputs. If the loop invariant or postcondition is wrong, the verifier rejects the code at compile time, not at runtime.

This is not about making the code shorter. It is about making the intended behavior explicit enough that the verifier can check it.

For engineering teams, the key shift is this: tests tell you what you checked. Proofs tell you what cannot happen, within the model you specified.

That is a different quality gate.

What this means for engineering teams

If vericoding goes mainstream, it will not mean every repository becomes a proof project. It will mean teams start separating code into tiers of assurance.

Assurance tiers

TierExamplesVerification approach
Ordinary product codeUI code, glue code, internal scripts, non-critical servicesTest-driven. Formal verification is overkill
Important invariantsAuth/authz logic, payment state transitions, concurrency-critical code, parsers, protocol logic, boundary checks, algorithms with costly edge casesTests are valuable but incomplete. This is where adoption will happen first
Safety and correctness criticalCompilers, cryptography, medical devices, industrial control, infrastructure automationProofs should become standard practice

Tier 2 is the interesting zone. These are the places where AI-generated code can silently introduce subtle errors, and where tests are valuable but do not cover the full state space.

Practical adoption path

If you are an engineering lead, the first change is not “hire proof engineers.” The first change is to identify the code paths where a proof would pay for itself.

StepAction
1Pick one invariant that already hurts you. “Balance never goes negative,” “state machine only advances forward,” “permission checks happen before side effects”
2Rewrite that invariant in a verification-friendly form. Preconditions, postconditions, or a small algebraic model
3Let the LLM draft the proof skeleton. Treat it like a senior assistant, not an oracle
4Make the verifier the final authority. If it does not prove, it is not done
5Track the cost. If the proof costs more than the bug it prevents, do not expand the scope yet

That is how formal verification moves from ideology to engineering economics.

Why this may happen sooner than people expect

The historical blocker was not philosophical disagreement. It was developer time.

That blocker is changing because AI lowers the marginal cost of proof iteration. A human engineer still has to understand the property, but the mechanical work of trying lemmas, massaging invariants, and recovering from failed proof attempts can now be offloaded in part.

This is exactly why Kleppmann’s framing matters. The conversation is no longer “Can formal verification replace testing?” That was never the right question. The real question is “Can AI make formal verification cheap enough for selected, high-value parts of ordinary software?” The answer is increasingly yes.

The reason this can go mainstream is not that proofs got less hard. It is that proof assistance got good enough to compress the hard part.

Honest take: what is still missing

There is a lot of hype around formal verification right now, and some of it should be resisted.

LimitationWhy it matters
Proofs do not save bad specificationsA verifier proves the spec you wrote, not the intent you had in your head. If the spec is wrong, the proof is faithfully wrong. AI does not remove this problem
LLMs are not trusted proof enginesLLMs hallucinate. They can produce plausible proof text that fails immediately under the checker. The model is a productivity tool, not an authority
Adoption will be unevenInfrastructure, security, compilers, and finance teams will adopt first. Many teams will never need it. That is normal
Tooling still needs ergonomics workEven with Lean, Dafny, and related systems, onboarding cost is high. Better IDE feedback, proof debugging, and model assistance will matter more than grand theory
The mainstream story is partial, not totalThe likely future is not “all software is formally verified.” It is “critical parts of many systems are verified, and LLMs make that practical enough to spread”

That last point is still a major shift.

Closing thought

Vericoding is not a slogan. It is a likely next step in how serious software teams manage risk in an AI-heavy development world.

Tests will stay. Reviews will stay. Observability will stay. But for the parts of software where failure is expensive and invariants matter, proof-assisted development is becoming plausible in a way it simply was not a few years ago.

The first teams to adopt it will not be the ones trying to prove everything. They will be the ones who realize that a verifier plus an LLM is sometimes cheaper than a long tail of incidents, regressions, and defensive code.

And that is when formal verification stops being academic and starts being operational.

Sources

SourceURL
Bursuc et al., “A benchmark for vericoding” (Sept 2025)arxiv.org
Martin Kleppmann, “AI will make formal verification go mainstream” (Dec 2025)martin.kleppmann.com
JetBrains Research, “Can LLMs Enable Verification in Mainstream Programming?” (March 2025)arxiv.org
Harmonic / Aristotleharmonic.fun, aristotle.harmonic.fun
Logical Intelligencelogicalintelligence.com
DeepSeek-Prover-V2 (April 2025)arxiv.org
Dafnydafny.org
Leanlean-lang.org
Nagini (Eilers and Mueller, CAV 2018)GitHub, ETH Zurich