AI IMO 2025: How many AI labs announce a Gold performance at the IMO in 2025?
➕
Plus
62
Ṁ16k
Aug 1
55%
0
25%
1
15%
2
2%
3
1.2%
4
0.8%
5
0.5%
6
0.4%
Other

This market resolves to N, where N is the number of distinct AI labs that have an AI system that meets ALL of these criteria in 2025:

  1. The AI system completes the official 2025 International Mathematical Olympiad problems under standard IMO time constraints (4.5 hours per 3-problem session)

  2. The system was not trained on IMO 2025 solutions (lol). This likely means the system's training was completed before the first day of the IMO 2025.

  3. Humans do not assist with the problem solving of the problems. They can however provide a formal proof language version of the problem.

  4. The system provides complete mathematical proofs (either in natural language or in formal proof languages like Lean), and the natural language proofs are judged to a similar standard as human participants.

  5. The system achieves a score that meets or exceeds the 2025 IMO Gold medal cutoff

  • Update 2025-28-01 (PST) (AI summary of creator comment): Update from creator

    • Market close date has been pushed back to allow for valid announcements within a month after the competition.

    • Announcements made more than a month after the competition will still be counted, potentially requiring the market to be reresolved.

  • Update 2025-06-18 (PST) (AI summary of creator comment): The creator has specified the following submission constraints, in response to a question about pass@1 evaluation:

    • For natural language solutions, only one solution can be submitted.

    • For formal proofs (e.g., Lean), the first valid proof will be the one considered for resolution.

  • Update 2025-07-08 (PST) (AI summary of creator comment): In response to discussion about how human IMO submissions are judged, the creator has specified a change to the submission criteria:

    • If human participants are allowed to include multiple attempts within their single submission packet, then AI systems will be judged by the same procedure.

    • This potentially modifies the previous clarification that only one solution could be submitted.

  • Update 2025-07-09 (PST) (AI summary of creator comment): In response to a discussion about submission limits, the creator has clarified the policy on multiple attempts:

    • Submitting thousands of random attempts for a human to piece together into a solution is not allowed.

    • The resolution will be guided by the spirit of the market, which is to judge performance in a way that feels fair and reasonable.

Get
Ṁ1,000
and
S3.00
Sort by:

An important constraint on human participants is that they only get to submit ONE solution. So it would be good to have clarification that this market will resolve based on pass@1.

@pietrokc if they provide a natural language solution that has to be manually checked, only one solution can be submitted. if they provide lean proof certificates or whatever, i don't really see the point in submitting more than 1 valid proof but yeah presumably they can select the first final proof that is valid

@pietrokc there's nothing stopping a human participant from submitting multiple solutions; they just have to be bundled together in a single stack of paper at the end of the competition day. 50 pages of nonsense followed by 1 page of correct solution still gets you full points.

@zsig That's definitely not how it worked when I was involved with this kind of thing, many years ago. I doubt it's how it works now. I guess I can't rule out that it's changed, but it seems very unlikely. If I'm grading a human's paper and they have 3 pages of nonsense, I'm not reading the rest.

@pietrokc Grading is done through the team leaders, their job is to convince the coordinators to give you as many points as possible based on what you've written down and since you don't know what gives points and what doesn't, the system actively encourages submitting literally everything you've written down including scrap paper.

It's not just one person reading and unilaterally deciding the score.

Source: I'm flying to Australia tomorrow for IMO 2025.

@zsig Interesting, thanks for sharing. Good luck at the IMO!

Going back to the AI case, what then would be the correct thing from an evaluation standpoint? Have another AI pick out what among the first AI's output should be graded? I guess it could be a human doing it, but if the solution-producing AI produces 1000s of candidate solutions, it would probably be worse to have a human do the picking, since it would be too slow.

@pietrokc Thanks :)

I think @Bayesian's interpretation is reasonable and most in line with the spirit of the question. This isn't asking if an LLM can get lucky and find a solution when given 1000 attempts, it's supposed to be a proxy for skill with consistency comparable to humans. We already know from o1-ioi that relaxing submission limits can significantly improve performance. I do think it wouldn't be incorrect to include the chain of thought for grading, although I don't know much about how it works and it might be prohibitively long (which is another problem with unrestricted submissions).

In general, I think attempting to give AI models the exact same conditions and possibilities that humans get is ultimately futile because the IMO regulations and contest setup are designed with humans in mind and so an uncritical reading of the rules will more likely than not lead to absurdities that circumvent actually addressing the questions this research and these markets are trying to answer. (Nobody participating in the IMO has access to formal systems in the same way AlphaProof does, but we still allow it because it's a meaningful way of solving problems that doesn't sidestep the question we're trying to answer)

tbc, if humans’ one submission can contain multiple different attempts, which I wasnt aware of, the ais will be allowed to be judged according to the same procedure. My bad if that was misleading

@Bayesian I agree that this is a tricky case, and I agree with @Zsigmondy about blindly applying human criteria here.

However, as they point out, the IMO regulations were created with humans in mind, and a human just simply never submits more than, say, 10 attempts for one question. (And if they ever did, that would drastically reduce how many attempts they submit on other questions.) So if an AI is allowed to submit 10,000 attempts, that seems qualitatively different and "unfair". As I've mentioned before, the limit case of this is an "AI" that just enumerates theorems of ZFC until it reaches a solution to the question.

Whether it matters that it's unfair gets at why we care about this market in the first place: will AIs help humans with math any time soon? An AI that enumerates theorems of ZFC obviously does not. An AI that gets the right answer in the first 3 tries obviously does. The 10,000 case, while very impressive if achieved, still feels more like a "no".

@Bayesian This wasn't made clear, but thinking of it as discrete, different attempts at a solution is wrong. The submission is just text with symbols and drawings where any part that indicates progress towards the full solution (sometimes weaker results that don't generalize are also awarded points such as in 2020 P6) can be awarded points. This also means that solutions can be written up as non-linearly as you like provided the team leader is able to tie it all together.

So if you have a problem where one approach leads you to X and another leads you to Y and a trivial combination of X and Y gives you the full solution, then an AI which never manages to do both X and Y in the same attempt but does manage to do X and Y separately would get very close to full points if allowed to append attempts after each other.

@zsig Now picturing an AI that in response to any problem just prints 10,000 copies of the alphabet in a row. If a solution exists with less than 10,000 characters you can always "tie this together" into a solution 😹

yeah thank you both for the feedback. It feels hard to set good specific limits in stone but I agree that enumerating thousands of random attempts and letting the human(s) do the work of piecing something sensible together wouldnt be allowed. The spirit of the market is largely that it would feel fair in as many ways as seems reasonable

bought Ṁ50 NO

Plausibly should resolve a bit later than Jul 1 / there should be a specification of the timing of valid announcements

@AdamK i pushed back the close date. hmmm i guess i could do within a month of the competition, but if someone announces gold a month and a day later, it still counts and gets reresolved? idk, better if there's less waiting but also i wouldn't wanna not count a lab that did achieve this

bought Ṁ30 NO


Arbitrage

Comment hidden
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules