For the first HLE score for grok 4 reasoning (or if multiple are released at once, the highest), unless it is a month+ after grok 4 release, will it show up on
https://scale.com/leaderboard/humanitys_last_exam
As a score of 45% or more, after rounding, for the reasoning version of grok 4?
Update 2025-07-06 (PST) (AI summary of creator comment): This market is about the model currently expected to be called grok 4, not strictly any model with that specific name.
Update 2025-07-11 (PST) (AI summary of creator comment): If the reasoning score reported on the linked website includes tool use, it will count for this market's resolution.
@SimoneRomeo even with coin flip, the number given was 44.4% on their slides. The 50.1% is a little misleading, and not actually the final value, I think. I guess there's like a 20% chance that the tool use version somehow qualifies, within that a 50% chance that the heavy version actually gets rated this month, and within that a 20% chance that by some luck, HLE calculates the final value as 45% rather than 44%. So... about 2% all together, maybe.
@SCS this one is base, not Heavy, and probably lines up with Grok's estimate of 25% with no tools. Ultimately this makes me slightly optimistic because it suggests there's some variability and perhaps when HLE tests it independently, they might get 45% instead of 44% for Heavy with tools