What will be the best performance on SWE-bench Verified by December 31st 2025?

Basic

𝕊314

resolved Feb 12

29%29%

85% - 90%

29%29%

90% - 100%

24%24%

75% - 85%

8%8%

70% - 75%

6%6%

60% - 70%

5%5%

53% - 60%

This market matches Software Engineering: SWE-bench Verified from the AI 2025 Forecasting Survey by AI Digest.

The best performance by an AI system on SWE-bench Verified as of December 31st 2025.

Resolution criteria

This resolution will use AI Digest as its source. If the number reported is exactly on the boundary (eg. 60%) then the higher choice will be used (ie. 60% - 70%).

Which AI systems count?

Any AI system counts if it operates within realistic deployment constraints and doesn't have unfair advantages over human baseliners.

Tool assistance, scaffolding, and any other inference-time elicitation techniques are permitted as long as:

There is no systematic unfair advantage over the humans described in the Human Performance section (e.g. AI systems are allowed to have multiple outputs autograded while humans aren't, or AI systems have access to the internet when humans don't).
Having the AI system complete the task does not use more compute than could be purchased with the wages needed to pay a human to complete the same task to the same level

The PASS@k elicitation technique (which automatically grades and chooses the best out of k outputs from a model) is a common example that we do not accept on this benchmark because human software engineers in the real world generally do not have access to scoring metric (unit tests) present in SWE-Bench Verified dataset when resolving a GitHub issue. So PASS@k consititutes an unfair advantage to AI systems. Luckily the SWE-Bench Verified leaderboard only accepts PASS@1 submissions.

If there is evidence of training contamination leading to substantially increased performance, scores will be accordingly adjusted or disqualified.

This question is managed and resolved by Manifold.

#AI

#️ Technology

#Technical AI Timelines

#AI 2025 Forecasting Survey by AI Digest

Get

1,000

and

3.00

8 Comments

11 Holders

73 Trades

Sort by:

bought Ṁ100 YES

https://x.com/claudeai/status/1972706807345725773

bought Ṁ500 YES

This benchmark is low quality despite the "verified" name. OpenAI already reports score on a subset (477/500)

bought Ṁ5 YES

https://www.swebench.com/#verified

There has been no improvement in the top score for over two months. In the last three months the top score has only improved by 2.4%

https://www.swebench.com/#verified

December saw large increases to the max score (from 54.2% to 62.2%) but January only gained a couple percentage points in the first half of the month and there have now been no improvements for over 30 days.

The sweepstakes market for this question has been resolved to partial as we are shutting down sweepstakes. Please read the full announcement here. The mana market will continue as usual.

Only markets closing before March 3rd will be left open for trading and will be resolved as usual.

Users will be able to cashout or donate their entire sweepcash balance, regardless of whether it has been won in a sweepstakes or not, by March 28th (for amounts above our minimum threshold of $25).

filled a Ṁ250 NO at 1.0% order

I think 75% is either already beaten (not publicly) or possible with a couple of months of engineering work (no new models). It starts to boil down to how correct this benchmark is, does it have e.g 3% of not quite correct questions (then likely >90%) or e.g 7% (then likely <90%)

Added liquidity!

bought 𝕊1.00 YES

Can more liquidity be added to this please @Manifold you can only bet pennies ATM, probably default liquidity for multi-option sweepstakes should be increased?

Resolution criteria

Which AI systems count?

Related questions

Related questions