Humanity’s Last Exam lists grok 4 at 45%+?

160

Ṁ130k

2 hours ago

chance

ALL

For the first HLE score for grok 4 reasoning (or if multiple are released at once, the highest), unless it is a month+ after grok 4 release, will it show up on

https://scale.com/leaderboard/humanitys_last_exam

As a score of 45% or more, after rounding, for the reasoning version of grok 4?

Update 2025-07-06 (PST) (AI summary of creator comment): This market is about the model currently expected to be called grok 4, not strictly any model with that specific name.

Update 2025-07-11 (PST) (AI summary of creator comment): If the reasoning score reported on the linked website includes tool use, it will count for this market's resolution.

This question is managed and resolved by Manifold.

Get

1,000

and

3.00

25 Comments

130 Holders

456 Trades

Sort by:

bought Ṁ3 YES

As far as I understand this market may resolve YES or NO whether HLE decides to allow tool use or not. Sounds like a coin flip, why are odds so low?

it is known that 2.5 pro does 25%+ with tool use from the grok 4 livestream graph. It is not presently on the lb. Simple inference

it might end up wrong but coinflip seems more wrong

@SimoneRomeo even with coin flip, the number given was 44.4% on their slides. The 50.1% is a little misleading, and not actually the final value, I think. I guess there's like a 20% chance that the tool use version somehow qualifies, within that a 50% chance that the heavy version actually gets rated this month, and within that a 20% chance that by some luck, HLE calculates the final value as 45% rather than 44%. So... about 2% all together, maybe.

Does this include Tool use or no?

@KJW_01294 it's whatever they decide to put on their website at the link provided

I suspect the website doesn't allow tools but if it does, it counts!

bought Ṁ50 YES

@Trazyn grok 4 reasoning (or if multiple are released at once, the highest),

https://x.com/ArtificialAnlys/status/1943166841150644622 24%

@SCS this one is base, not Heavy, and probably lines up with Grok's estimate of 25% with no tools. Ultimately this makes me slightly optimistic because it suggests there's some variability and perhaps when HLE tests it independently, they might get 45% instead of 44% for Heavy with tools