For formally-verified mathematics and proof-carrying code, artificial general intelligence (AGI) arrives as the following milestones are achieved in coming years.
AI in 2025, like the top researchers, is going to:
AI in 2026, better than any human, is going to:
AI in 2027+, beyond all humans combined, is going to:
OpenAI’s o3 represents substantial progress in 🧮 general-domain reasoning with reinforcement learning. On FrontierMath 2024-11-26, o3 improved the state of the art from 2% to 25% accuracy. These are extremely difficult, well-established, held-out math problems.
Terrence Tao (Fields Medalist): These are extremely challenging… I think they will resist AIs for several years at least. 🎖️ Timothy Gowers (Fields Medalist): Getting even one question right would be well beyond what we can do now, let alone saturating them. Evan Chen (IMO Coach): These are genuinely hard problems… most of them look well above my pay grade.
Noam Brown: Saturating evals like @EpochAIResearch's Frontier Math would suggest 🦸 AI is surpassing top human intelligence in certain domains. When that happens we may see a broad acceleration in scientific research. Roughly: 25% are Tier 1 (advanced IMO/Putnam level), 50% are Tier 2 (extremely challenging grad-level), and 25% are Tier 3 (research problems). All can take hours—or days—for experts to solve.
👑 Solving [Tier 4 problems] would demonstrate capabilities on par with an entire top mathematics department. Each problem will be composed by a team of 1-3 mathematicians specialized in the same field over a 6-week period, with weekly opportunities to discuss ideas with teams in related fields. We seek broad coverage of mathematics and want all major subfields represented in Tier 4.
Can AI do maths yet? Thoughts from a mathematician.
The FrontierMath paper contains some quotes from mathematicians about the difficulty level of the problems. Tao (Fields Medal) says “These are extremely challenging” and suggests that they can only be tackled by a domain expert (and indeed the two sample questions which I could solve are in arithmetic, my area of expertise; 🧠 I failed to do all of the ones outside my area). Borcherds (also Fields Medal) however is quoted in the paper as saying that machines producing numerical answers “aren’t quite the same as coming up with original proofs”.
Machine-Assisted Proof, Terence Tao. American Mathematical Society, Jan 2025.
[AI] to generate (first drafts of) entire research papers, 💯 complete with formal verification, by explaining a result in natural language to an AI, who would try to formalize each step of the result and query the author whenever clarification is required.
… to produce the best estimate it can, given some set of initial hypotheses and methods, without first performing some pen-and-paper calculation to guess what that estimate would be [on] some integral expression involving one or more unknown functions (such as solutions to a PDE), 🧮 using bounds on such functions in various function space norms (such as Sobolev space norms), together with standard inequalities (e.g., the Hölder inequality and the Sobolev inequalities).
… to study hundreds of equations at once, perhaps working out an argument in full for just one equation and letting AI tools then adapt the arguments to large families of related equations, querying the author as necessary whenever the extension of the arguments is nonroutine… such as ⛵ automated exploration of conjectures in graph theory.
Formal Mathematical Reasoning: A New Frontier in AI (2024-12-20). Kaiyu Yang, Gabriel Poesia, Jingxuan He, Wenda Li, Kristin Lauter, Swarat Chaudhuri, Dawn Song.
Formal Mathematical Reasoning: A New Frontier in AI
🧬 AI agents might make autonomous contributions to human projects, and vice-versa, mediated by formal tools that guarantee correctness. This vision can likely borrow from previous experiences in distributed, collaborative computing platforms from other fields, like Folding@Home, where anyone can contribute computing power to protein folding. The same is possible for proving mathematical theorems.
➗ Theorem Proving. We might one day expect to have systems that go beyond what human mathematicians have been able to accomplish, solving existing open problems, or perhaps even formulating new interesting problems on their own. Currently, formal systems are typically used to formalize results that human mathematicians have first done informally. If one day AI becomes able to autonomously make mathematical discoveries, we can expect ✅ these discoveries to be made in a formal system.
🚗 Autoformalization. AI models should be able to invent novel mathematical definitions that can potentially reduce proof complexity. At this level, an AI model is closer to ✅ a “theory builder” that can reshape the proving process through better abstraction or concept formulation. For instance, filters (i.e., a set of sets satisfying certain properties) are rarely taught in standard math curriculum. However, as a convenient abstraction in mathematical analysis, they have become a widely adopted concept for formalizing limits in various proof systems.
💭 Conjecturing. The future AI systems might be able to formulate conjectures that go beyond current domains of mathematics. Some expect that a major milestone for future AI systems will be to ✅ both conjecture and prove a high-level, interesting mathematical result. While that goes far beyond current systems, such an achievement would mark an era where AIs work side-by-side with human mathematicians in expanding our collective knowledge of mathematics.
🛡️ Formal Verification and Verified Generation. AI systems make another leap by aiding users in deriving formal specifications, including specification generation, explanation, and debugging. To benchmark specification assistance, we can again leverage verified codebases and designs, but instead of generating proofs and code given the specifications, we treat ✅ the code and proofs as ground truth and use them to evaluate specifications produced by the model.
📝 Verified Reasoning in Natural Language. Real-world applications that involve complex reasoning are often not purely mathematical problems, yet they contain significant mathematical components along with other components such as common-sense and human preferences. Level 4 requires AI to recognize the mathematics in everyday tasks and apply rigorous reasoning. For example, in scenarios like travel planning or calendar scheduling, AI can potentially formulate these tasks as constraint satisfaction problems and ✅ solve them using appropriate solvers such as mixed-integer linear programming (MILP) solvers.