Could AI Supplant a Mathematician?

Professor Andrew Blumberg is part of a team that tested AI’s upper limits by asking it to answer unsolved math problems.

February 20, 2026

It’s hard to distinguish AI hype from its true capabilities, but Andrew Blumberg wants to try. Blumberg is the Herbert and Florence Irving Professor of Cancer Data Research and a professor of mathematics and computer science at Columbia. He is part of a team of mathematicians who recently tested the limits of AI’s capabilities by devising 10 math problems that humans had not yet published solutions to publicly, and posing them to several leading large language models (LLMs) to see whether they could solve them. The group deliberately kept their solutions off of the internet until last week to ensure that they would not be ingested into the training data that powers LLMs. “First Proof,” a new paper that describes the experiment, is published on the open access platform arXiv, and was recently covered in The New York Times.

The group’s goal was not to see whether the models could solve a standard problem you might find on an SAT or GRE, but to see whether it could advance human knowledge at its frontier, and maybe even spot something the mathematicians themselves hadn’t. 

“We’re not trying to prove that machines will never supplant people, or some other very simplistic narrative,” Blumberg said of the research. “What we’re really interested in is scientific truth as it applies to this domain, which means understanding in an honest way what these things can do.”

Columbia News spoke to Blumberg about the project, and the group’s findings.

Each of your collaborators on this project submitted a math problem for the LLMs to solve. What is the problem that you submitted, and why did you choose it?

The problem aims to find a version of the slice filtration on the equivariant stable category adapted to an N-infinity operad / transfer system that's potentially incomplete.

It’s a problem arising from my ongoing work with Mike Hill and Taylor Lawson at the University of Minnesota. This is a proposition that we have written that hasn’t come out publicly yet, so it can’t be used as training data by AIs. I chose it because it felt like something that is a natural  generalization of something in the literature, but not a trivial generalization, so it felt like a good test of the boundary of what AIs can and can’t do. 

How did the LLMs do with your problem?

It was interesting. They did OK. The answer to the problem is a formula and a proof of a formula, and they guessed the answer to the formula, which is good. There’s a version of the proof that is in the literature—so LLMs have ingested it—and it’s a bit easier than the actual proof. And the LLMs follow the outlines of the easier versions, which is one way you could go to prove it. There’s some things about the proofs that are good, and some things that have various small errors. A bunch of them hallucinated whole papers, and cited them. I think if I got the answers it gave me from a graduate student I would feel like it was a good start.

How did they do with the other problems?

We found that the LLMs were able to completely solve two of the 10 problems we sent them, and basically it’s what you would expect: The problems that are close to something that was already solved in the literature that LLMs train on are the ones that they could solve.

A lot of people have compared mathematics to computer programming, and suggested that the way that these AI tools can handle programming suggests they’re coming for math, too. But programming, for the most part, does not care about novelty: You have a problem that needs to get done, and if someone has solved it before, or solved something similar before, that’s great. Math is looking for what has yet to be solved. The best papers are the ones that introduce new ideas that no one has ever had before, new ways of looking at things. I don't mean to say that machines could never do this. That's an irresponsible position to take. It’s just to say that the example of programming is not as instructive as I think people sometimes claim it is.

Would you extrapolate from your findings to what this can tell us about how LLMs handle problems beyond mathematics?

I’d say a couple of things very broadly: 

One thing is that people are astonishingly bad at guessing what will happen when things that used to be expensive become free, or at least much cheaper. So I think this technology will have lots of impact, and we don't know what it is yet, in the same way that, although it was easy for people to guess a long time ago some aspects of what it would be like when everyone had a phone in their pocket, no one in Star Trek guessed about Instagram.

I think these are going to be tremendously consequential in the workflows of experienced practitioners. I don’t really use them to solve advanced math, for example, but I do use them to create my bibliographies, and that saves a ton of time. This will change many of the ways all kinds of professionals and academics handle a wide range of tasks.

I think AI will also really disrupt training of junior people in various fields. But some of that disruption could be good. We’ve always known big lectures aren’t necessarily the best way to deliver knowledge, and who knows what it might help us come up with next.

Is there anything the LLMs are—maybe surprisingly—good at? 

My colleague Mehtaab Sawhney, who's been visiting OpenAI for a year, gave a talk recently about how LLMs helped him solve a problem by finding a similar problem in another math subfield. Often in math and science, we’re really deep in one area. Math is big and it’s hard to know all of it. But these models can kind of know all of it. And having a tool that can tell you, “that question you're asking is well studied in a domain you've never heard of,” is a big opportunity for progress.

This month, we published our first series of test problems and solutions, and we’re inviting members of the math community and the public to probe and test them, and suggest new problems we could pose to the LLMs. We’re planning to gather input and create a second round of new problems that we pose in a similar way to the LLMs in a few months. 

I want to be clear that we don’t see our relationship with AI as adversarial. We just want to understand what they can and cannot do in a clear-eyed way. And the answer will certainly change, and probably will change quickly.