A defining memory from my senior year of high school was a nine-hour math exam with just six questions. Six of the top scorers won slots on the U.S. team for the International Math Olympiad (IMO), the world’s longest running math competition for high school students. I didn’t make the cut, but became a tenured mathematics professor anyway.
This year’s olympiad, held last month on Australia’s Sunshine Coast, had an unusual sideshow. While 110 students from around the world went to work on complex math problems using pen and paper, several AI companies quietly tested new models in development on a computerized approximation of the exam. Right after the closing ceremonies, OpenAI and later Google DeepMind announced that their models earned (unofficial) gold medals for solving five of the six problems. Researchers like Sébastien Bubeck of OpenAI celebrated these models’ successes as a “moon landing moment” by industry.
But are they? Is AI going to replace professional mathematicians? I’m still waiting for the proof.
On supporting science journalism
If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
The hype around this year’s AI results is easy to understand because the olympiad is hard. To wit, in my senior year of high school, I set aside calculus and linear algebra to focus on olympiad-style problems, which were more of a challenge. Plus the cutting-edge models still in development did so much better at the exam than the commercial models already out there. In a parallel contest administered by MathArena.ai, Gemini 2.5 pro, Grok 4, o3 high, o4-mini high and DeepSeek R1 all failed to produce a single completely correct solution. It shows that AI models are getting smarter, their reasoning capabilities improving rather dramatically.
Yet I’m still not worried.
The latest models just got a good grade on a single test—as did many of the students—and a head-to-head comparison isn’t entirely fair. The models often employ a “best-of-n” strategy, generating multiple solutions and then grading themselves to select the strongest. This is akin to having several students work independently, then get together to pick the best solution and submit only that one. If the human contestants were allowed this option, their scores would likely improve too.
Other mathematicians are similarly cautioning against the hype. IMO gold medalist Terence Tao (currently a mathematician at the University of California, Los Angeles) noted on Mastodon that what AI can do depends on what the testing methodology is. IMO president Gregor Dolinar said that the organization “cannot validate the methods [used by the AI models], including the amount of compute used or whether there was any human involvement, or whether the results can be reproduced.”
Besides, IMO exam questions don’t compare to the kinds of questions professional mathematicians try to answer, where it can take nine years, rather than nine hours, to solve a problem at the frontier of mathematical research. As Kevin Buzzard, a mathematics professor at Imperial College London, said in an online forum, “When I arrived in Cambridge UK as an undergraduate clutching my IMO gold medal I was in no position to help any of the research mathematicians there.”
These days, mathematical research can take more than one lifespan to acquire the right expertise. Like many of my colleagues, I’ve been tempted to try “vibe proving”—having a math chat with an LLM as one would with a colleague, asking “Is it true that…” followed by a technical mathematical conjecture. The chatbot often then supplies a clearly articulated argument that, in my experience, tends to be correct when it comes to standard topics but subtly wrong at the cutting edge. For example, every model I’ve asked has made the same subtle mistake in assuming that the theory of idempotents behaves the same for weak infinite-dimensional categories as it does for ordinary ones, something that human experts (trust me on this) in my field know to be false.
I’ll never trust an LLM—which at its core is just predicting what text will come next in a string of words, based on what’s in its dataset—to provide a mathematical proof that I can’t verify myself.
The good news is, we do have an automated mechanism for determining whether proofs can be trusted. Relatively recent tools called “proof assistants” are software programs (they don’t use AI) designed to check whether a logical argument proves the stated claim. They are increasingly attracting attention from mathematicians like Tao, Buzzard and myself who want more assurance that our own proofs are correct. And they offer the potential to help democratize mathematics and even improve AI safety.
Suppose I received a letter, in unfamiliar handwriting, from Erode, a city in Tamil Nadu, India, purporting to contain a mathematical proof. Maybe its ideas are brilliant, or maybe they’re nonsensical. I’d have to spend hours carefully studying every line, making sure the argument flowed step-by-step, before I’d be able to determine whether the conclusions are true or false.
But if the mathematical text were written in an appropriate computer syntax instead of natural language, a proof assistant could check the logic for me. A human mathematician, such as I, would then only need to understand the meaning of the technical terms in the theorem statement. In the case of Srinivasa Ramanujan, a generational mathematical genius who did hail from Erode, an expert did take the time to carefully decipher his letter. In 1913 Ramanujan wrote to the British mathematician G. H. Hardy with his ideas. Luckily, Hardy recognized Ramanujan’s brilliance and invited him to Cambridge to collaborate, launching the career of one of the all-time mathematical “greats.”
What’s interesting is that some of the AI IMO contestants submitted their answers in the language of the Lean computer proof assistant so that the computer program could automatically check for errors in their reasoning. A start-up called Harmonic posted formal proofs generated by their model for five of the six problems, and ByteDance achieved a silver-medal level performance by solving four of the six problems. But the questions had to be written to accommodate the models’ language limitations, and they still needed days to figure it out.
Still, formal proofs are uniquely trustworthy. While so-called “reasoning” models are prompted to break problems down into pieces and explain their “thinking” step by step, the output is as likely to produce an argument that sounds logical but isn’t, as to constitute a genuine proof. By contrast, a proof assistant will not accept a proof unless it is fully precise and fully rigorous, justifying every step in its chain-of-thought. In some circumstances, a hand-waving or approximate solution is good enough, but when mathematical accuracy matters, we should demand that AI-generated proofs are formally verifiable.
Not every application of generative AI is so black and white, where humans with the right expertise can determine whether the results are correct or incorrect. In life, there is a lot of uncertainty and it’s easy to make mistakes. As I learned in high school, one of the best things about math is the fact that you can prove definitively that some ideas are wrong. So I’m happy to have an AI try to solve my personal math problems, but only if the results are formally verifiable. And we aren’t quite there, yet.
This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.