What vericoding means, and why it matters now
“Vericoding” is the idea that software is no longer judged only by tests, code review, and runtime monitoring. It is judged by whether the intended behavior can be proved.
The term was introduced in a September 2025 paper by Bursuc et al. (BAIF + MIT, including Max Tegmark): “A benchmark for vericoding: formally verified program synthesis”. They define it as LLM-generation of formally verified code from formal specifications, as the counterpart to “vibe coding,” which generates potentially broken code from natural language.
That framing matters because it captures what has changed. AI systems are now generating a meaningful share of production code. At the same time, formal verification tools have become more practical, more integrated, and more visible to engineers outside academia. The shift is not that proofs suddenly got easier. The shift is that LLMs can now sit in the middle of the proof process and reduce the cost of writing and maintaining proofs.
Martin Kleppmann published a clear prediction on 8 December 2025: AI will make formal verification go mainstream. His argument is threefold. First, formal verification will become drastically cheaper. Second, AI-generated code needs formal verification so that human review can be skipped safely. Third, the precision of formal verification complements the probabilistic nature of LLMs. He makes the point explicitly: it does not matter if LLMs hallucinate, because the proof checker will reject invalid proofs.
That is stronger than “worth taking seriously.” It is a direct prediction that these two fields are converging into a practical engineering discipline.
This is the opposite of vibecoding. Vibecoding accepts uncertainty and moves fast. Vericoding accepts that when software is high stakes, speed without guarantees is debt.
Why formal verification never became mainstream
The honest answer is simple. It was too expensive, too slow, and too academic.
| Barrier | What happened |
|---|---|
| Manual proof burden | Engineers had to express precise specs, invariants, pre/postconditions, and then prove the implementation matches. More work than writing the code itself |
| Wrong target audience | Tools were optimized for mathematicians and niche high-assurance teams. Product teams needed speed, not theorem proving |
| Fragmented toolchain | Teams had to choose between language features, separate proof assistants, custom annotations, and special-purpose frameworks that sat outside the normal developer loop |
The tools were always powerful
Dafny is a verification-aware programming language with native support for specifications and a static program verifier. It compiles to C#, Java, JavaScript, Go, and Python, and ships with IDE plugins, a language server, and tooling designed to fit development workflows.
Lean is an open-source programming language and proof assistant. Its core design emphasizes a minimal trusted kernel for absolute correctness, powerful automation, and extensibility for mathematical and software verification.
Nagini is an automated verifier for statically-typed, concurrent Python 3 programs, built on the Viper verification infrastructure. It can verify memory safety, functional properties, termination, deadlock freedom, and input/output behavior. Source: Eilers and Mueller, CAV 2018.
The value was real. The adoption path was steep.
The turning point: LLMs as proof assistants
The most important change is not that LLMs are smarter than proofs. They are not. The change is that they can absorb the repetitive, error-prone part of proof construction.
That matters because most verification work is not about deep insight every minute. It is about getting the specification shape right, decomposing the theorem, finding the right lemma, and iterating when the verifier rejects the first attempt.
A March 2025 paper from JetBrains Research (“Can LLMs Enable Verification in Mainstream Programming?”) tested LLMs on verified code synthesis across three verification languages: Dafny, Nagini, and Verus. Claude 3.5 Sonnet massively surpassed other models (GPT-3.5, GPT-4o, Claude 3 Opus), leading the researchers to focus exclusively on it. The practical implication is straightforward: LLMs are becoming proof assistants, not proof substitutes. They can propose invariants, fill in boilerplate, and help translate programmer intent into verifier-friendly structure. The human still owns the specification, the model checking, and the final judgment.
What changed technically
Three things changed at once:
| Change | Why it matters |
|---|---|
| LLMs got better at syntax and idiom | They can now write code in proof-oriented languages without complete nonsense in every other line |
| Verification tools improved their UX | Modern Lean, Dafny, and related tools are more scriptable, better documented, and more integrated into editor workflows |
| The failure cost of normal software got higher | AI-generated code scales output but also scales mistakes, especially in authorization, state transitions, concurrency, and data handling |
Once those three curves intersect, proof tooling becomes economically interesting.
The current stack: Lean, Dafny, and Nagini
A useful way to think about vericoding is to separate the proof language from the production language.
| Tool | Strength | Best for | Source |
|---|---|---|---|
| Lean | Powerful theorem proving, minimal trusted kernel, strong automation | Core logic, algorithms, protocol properties, mathematical verification | lean-lang.org |
| Dafny | Code and specs in one place, compiles to C#/Java/JS/Go/Python, IDE integration | Application-level verification, parsers, critical subsystems | dafny.org |
| Nagini | Python verification, concurrency support, Viper-based | Python codebases needing stronger guarantees in specific modules | GitHub |
Lean is especially strong when the specification problem is central and you want powerful theorem proving infrastructure. It is not the first tool most application teams will adopt for everyday service code, but it may become the strongest tool for verifying core logic, algorithms, and protocol properties.
Dafny is closer to the application engineer’s world. It lets you write code and specifications in one place, then verify that the implementation meets them. That lowers the distance between proof and production. A team can verify an algorithm, a parser, or a critical subsystem without living entirely inside proof scripts.
Nagini verifies Python programs with a focus on memory safety, functional properties, termination, and deadlock freedom for concurrent code. That makes it relevant for teams that want to stay in Python but need stronger guarantees in specific parts of the system.
The startups making this real
Academic progress matters, but mainstream adoption usually needs product pressure.
| Company | What they do | Why it matters | Source |
|---|---|---|---|
| Harmonic | Built Aristotle, an AI system that translates natural-language math problems into formally verifiable proofs in Lean 4. Claimed gold-medal-level IMO performance. $1.45B valuation, Series C ($120M+). Co-founded by Vlad Tenev (Robinhood CEO) | Proof that formal reasoning scales commercially. Positions “hallucination-free” mathematical AI as a product category | harmonic.fun, Yahoo Finance |
| Logical Intelligence | Building energy-based reasoning models (not LLM-based) that are deterministic and mathematically verifiable. Products: Aleph (formal verification + code generation with machine-checkable proofs), Kona (energy-based reasoning platform). Focus: autonomy, finance, defense. Yann LeCun on advisory board | A different approach: deterministic AI for critical systems, explicitly positioned as an alternative to probabilistic LLMs | logicalintelligence.com, BusinessWire |
| DeepSeek (Prover-V2) | Open-source LLM for formal theorem proving in Lean 4. 671B parameter variant achieves 88.9% on MiniF2F-test and solves 49/658 PutnamBench problems. Uses reinforcement learning with subgoal decomposition | Proof generation treated as a dedicated model capability. When model families are explicitly trained for proof tasks, the old argument that proof writing will always be too manual starts to weaken | arXiv:2504.21801, GitHub |
The pattern across all three: verification is not just a research demo anymore. It is being productized.
The caveat is obvious. A prover model is not the same thing as a verified program. It can propose proof steps, but the checker still decides whether the proof is valid. That separation is exactly why vericoding is viable. The model can assist. The verifier can judge.
A small example: unit test versus proof
The easiest way to see the difference is to compare a normal test with a property proof.
Unit test
A test checks one example.
def test_parse_empty_list():
assert parse_int_list("[]") == []
Useful, but limited. It tells you one case works.
Proof-oriented thinking
In Dafny, you can express a property, not just an example:
function SumSeq(xs: seq<int>): int
decreases |xs|
{
if |xs| == 0 then 0
else xs[0] + SumSeq(xs[1..])
}
method Sum(xs: seq<int>) returns (s: int)
ensures s == SumSeq(xs)
{
var i := 0;
s := 0;
while i < |xs|
invariant 0 <= i <= |xs|
invariant s == SumSeq(xs[..i])
{
s := s + xs[i];
i := i + 1;
}
}
The function SumSeq defines what the sum means. The method Sum computes it. The verifier checks that they agree for all possible inputs. If the loop invariant or postcondition is wrong, the verifier rejects the code at compile time, not at runtime.
This is not about making the code shorter. It is about making the intended behavior explicit enough that the verifier can check it.
For engineering teams, the key shift is this: tests tell you what you checked. Proofs tell you what cannot happen, within the model you specified.
That is a different quality gate.
What this means for engineering teams
If vericoding goes mainstream, it will not mean every repository becomes a proof project. It will mean teams start separating code into tiers of assurance.
Assurance tiers
| Tier | Examples | Verification approach |
|---|---|---|
| Ordinary product code | UI code, glue code, internal scripts, non-critical services | Test-driven. Formal verification is overkill |
| Important invariants | Auth/authz logic, payment state transitions, concurrency-critical code, parsers, protocol logic, boundary checks, algorithms with costly edge cases | Tests are valuable but incomplete. This is where adoption will happen first |
| Safety and correctness critical | Compilers, cryptography, medical devices, industrial control, infrastructure automation | Proofs should become standard practice |
Tier 2 is the interesting zone. These are the places where AI-generated code can silently introduce subtle errors, and where tests are valuable but do not cover the full state space.
Practical adoption path
If you are an engineering lead, the first change is not “hire proof engineers.” The first change is to identify the code paths where a proof would pay for itself.
| Step | Action |
|---|---|
| 1 | Pick one invariant that already hurts you. “Balance never goes negative,” “state machine only advances forward,” “permission checks happen before side effects” |
| 2 | Rewrite that invariant in a verification-friendly form. Preconditions, postconditions, or a small algebraic model |
| 3 | Let the LLM draft the proof skeleton. Treat it like a senior assistant, not an oracle |
| 4 | Make the verifier the final authority. If it does not prove, it is not done |
| 5 | Track the cost. If the proof costs more than the bug it prevents, do not expand the scope yet |
That is how formal verification moves from ideology to engineering economics.
Why this may happen sooner than people expect
The historical blocker was not philosophical disagreement. It was developer time.
That blocker is changing because AI lowers the marginal cost of proof iteration. A human engineer still has to understand the property, but the mechanical work of trying lemmas, massaging invariants, and recovering from failed proof attempts can now be offloaded in part.
This is exactly why Kleppmann’s framing matters. The conversation is no longer “Can formal verification replace testing?” That was never the right question. The real question is “Can AI make formal verification cheap enough for selected, high-value parts of ordinary software?” The answer is increasingly yes.
The reason this can go mainstream is not that proofs got less hard. It is that proof assistance got good enough to compress the hard part.
Honest take: what is still missing
There is a lot of hype around formal verification right now, and some of it should be resisted.
| Limitation | Why it matters |
|---|---|
| Proofs do not save bad specifications | A verifier proves the spec you wrote, not the intent you had in your head. If the spec is wrong, the proof is faithfully wrong. AI does not remove this problem |
| LLMs are not trusted proof engines | LLMs hallucinate. They can produce plausible proof text that fails immediately under the checker. The model is a productivity tool, not an authority |
| Adoption will be uneven | Infrastructure, security, compilers, and finance teams will adopt first. Many teams will never need it. That is normal |
| Tooling still needs ergonomics work | Even with Lean, Dafny, and related systems, onboarding cost is high. Better IDE feedback, proof debugging, and model assistance will matter more than grand theory |
| The mainstream story is partial, not total | The likely future is not “all software is formally verified.” It is “critical parts of many systems are verified, and LLMs make that practical enough to spread” |
That last point is still a major shift.
Closing thought
Vericoding is not a slogan. It is a likely next step in how serious software teams manage risk in an AI-heavy development world.
Tests will stay. Reviews will stay. Observability will stay. But for the parts of software where failure is expensive and invariants matter, proof-assisted development is becoming plausible in a way it simply was not a few years ago.
The first teams to adopt it will not be the ones trying to prove everything. They will be the ones who realize that a verifier plus an LLM is sometimes cheaper than a long tail of incidents, regressions, and defensive code.
And that is when formal verification stops being academic and starts being operational.
Sources
| Source | URL |
|---|---|
| Bursuc et al., “A benchmark for vericoding” (Sept 2025) | arxiv.org |
| Martin Kleppmann, “AI will make formal verification go mainstream” (Dec 2025) | martin.kleppmann.com |
| JetBrains Research, “Can LLMs Enable Verification in Mainstream Programming?” (March 2025) | arxiv.org |
| Harmonic / Aristotle | harmonic.fun, aristotle.harmonic.fun |
| Logical Intelligence | logicalintelligence.com |
| DeepSeek-Prover-V2 (April 2025) | arxiv.org |
| Dafny | dafny.org |
| Lean | lean-lang.org |
| Nagini (Eilers and Mueller, CAV 2018) | GitHub, ETH Zurich |