A few months before the 2025 International Mathematical Olympiad (IMO) in July, a three-person team at OpenAI made a long bet that they could use the competition’s brutally tough problems to train an artificial intelligence model to think on its own for hours so that it was capable of writing math proofs. Their goal wasn’t simply to create an AI that could do complex math but one that could evaluate ambiguity and nuance—skills AIs will need if they are to someday take on many challenging real-world tasks. In fact, these are precisely the skills required to create artificial general intelligence, or AGI: human-level understanding and reasoning.
The IMO, held this year on Australia’s Sunshine Coast, is the world’s premier math competition for high schoolers, bringing together top contenders from more than 100 countries. All are given the same six problems—three per day, each worth seven points—to solve over two days. But these problems are nothing like what you probably remember from high school. Rather than a brief numeric answer, each demands sustained reasoning and creativity in the form of a pages-long written proof. These logical, step-by-step arguments have to span many fields of mathematics—exactly the sort of problems that, until just this year, AI systems failed at spectacularly.
The OpenAI team of researchers and engineers—Alex Wei, Sheryl Hsu and Noam Brown—used a general-purpose reasoning model: an AI designed to “think” through challenging problems by breaking them into steps, checking its own work and adapting its approach as it goes. Though AI systems couldn’t officially compete as participants, the notoriously tough test served as a demonstration of what they can do, and the AIs tackled this year’s questions in the same test format and with the same constraints as human participants. Upon receiving the questions, the team’s experimental system worked for two 4.5‑hour sessions (just as the student contestants did), without tools or the Internet—it had absolutely no external assistance from tools such as search engines or software designed for math. The proofs it produced were graded by three former IMO medalists and posted online. The AI completed five of the six problems correctly, receiving 35 out of 42 points—the minimum required for an IMO gold medal. (Google’s DeepMind AI system also achieved that score this year.) Out of 630 competitors, only 26 students, or 4 percent, outperformed the AI; five students achieved perfect 42s. Given that a year ago language-based AI systems like OpenAI’s struggled to do elementary math, the results were a dramatic leap in performance.
On supporting science journalism
If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
In the following conversation, Scientific American spoke with two members of the OpenAI team, Alex Wei and Sheryl Hsu, to discuss how they conducted their work, why the model’s lack of response to the sixth question was actually a major step toward addressing AI’s “hallucination” problem and how developing a system capable of writing complex proofs could help lead to artificial general intelligence.
[An edited transcript of the interview follows.]
What led you to suddenly begin preparing an AI model for the IMO just a few months before the competition? What was the spark?
WEI: I had been thinking about math proofs for quite a while. I’m on a team at OpenAI called MathGen. We had just seen the results progress a lot. We felt like we had a shot to get a model that could do really well at the IMO, and we wanted to make a mad dash to get there.
HSU: I used to do math competitions. [Wei] used to do math competitions—he was a lot better than me. The IMO is definitely well known within the [AI research] community, including among researchers at OpenAI. So it was really inspiring to push specifically for that.
Can you talk about your decision to work with a general‑purpose AI system rather than a system that was specifically designed to answer math problems?
WEI: The philosophy is that we want to build general‑purpose AI and develop methods that don’t just work for math. Math is a very good proving ground for AI because it’s fairly objective: if you have a proof, it’s easier to get consensus on whether it’s correct. That’s harder for, say, poetry—you’ll have more disagreement among readers. And IMO problems are very hard, so we wanted to tackle hard problems with general‑purpose methods in the hope that they’ll also apply to domains beyond math.
HSU: I’d also say the goal at OpenAI is to build AGI—it’s not necessarily to write papers or win competitions. It was important that everything we did for this project also be useful for the bigger goal of building AGI and better models that users can actually use.
In what ways could a reasoning model winning a gold in the IMO help lead to AGI?
WEI: One perspective is to think in terms of how long tasks take. A year ago, ChatGPT could only do very basic math problems. Two years ago—and even a year and a half ago—we were often thinking about grade‑school math problems you’d find on fifth‑grade homework. For someone really good at math, those take a second or two to read and solve. Then we started evaluating using AIME [the American Invitational Mathematics Examination, a 15-question high school math contest]. That takes around 10 minutes per problem, with about three hours for 15 problems. The IMO is four and a half hours for just three problems—that’s 90 minutes per problem. ChatGPT started off being good for quick questions. Now it’s better at longer‑running tasks, such as “Can you edit this paragraph for me?” As AI improves, you can expand the time horizon of tasks, and you can see that progression clearly in math.
HSU: Another aspect is that reasoning models were previously very good at tasks that are easy to verify. If you’re solving a non‑proof‑based math problem, there’s one numerically correct answer. It’s easy to check. But in the real world—and in the tasks people actually want help with—it’s more complex. There’s nuance: maybe it’s mostly correct but has some errors; maybe it’s correct but could be stylized better. Proof‑based math isn’t trivial to evaluate. If we think about AGI, those tasks won’t be easy to judge as correct or not; they’ll be more loosely specified and harder overall.
What was the process for training the model?
WEI: In general, reinforcement learning trains a model by rewarding good behavior and penalizing bad behavior. If you repeatedly reinforce good behavior and discourage bad behavior, the model becomes more likely to exhibit the good behavior.
HSU: Toward the end, we also scaled up test‑time compute [how long the AI model was able to “think” before answering]. Previously, for a human, problems of this sort might be a few minutes; now we were scaling to hours. That extra thinking time gave surprising gains. There was a moment when we ran evaluations on our internal test set that took a long time because of the increased test‑time compute. When we finally looked at the results—and Alex graded them—seeing the progress made me think gold might be within reach. That was pretty exciting.
On the IMO test, the model you developed got five out of six answers correct. But with the sixth question, the model didn’t try to provide an answer. Can you tell me more about the significance of this response?
WEI: The model knowing what it doesn’t know was one of the early signs of [progress] we saw. Today if you use ChatGPT, you’ll sometimes see “hallucinations”—models don’t reliably know when they don’t know. That capability isn’t specific to math. I’d love it if, for everyday questions, the model could honestly say when it doesn’t know instead of giving an answer I must verify independently.
What kind of impact could your work on this model have on future models?
HSU: Everything we did for this project is fairly general‑purpose—being able to grade outputs that aren’t single answers and to work on hard problems for a long time while making steady progress. Those contributed a lot to the success here, and now we and others at OpenAI are applying them beyond math. It’s not in GPT‑5, but in future models, we’re excited to integrate these capabilities.
WEI: If you look at the solutions we publicly posted for the IMO problems, some are very long—five to 10 pages. This model can generate long outputs that are consistent and coherent, without mistakes. Many current state‑of‑the‑art models can’t produce a totally coherent five‑page report. I’m excited that this care and precision will help in many other domains.