Will o1 score ≥60% on the REBUS benchmark?
➕
Plus
5
Ṁ405
Feb 28
62%
chance

Get
Ṁ1,000
and
S3.00
Sort by:

I'll probably try running this this week if I can automate the web interaction (unless the API comes out before then)

Referring, of course, to the famous https://arxiv.org/abs/2401.05604

bought Ṁ100 YES

For reference, the release version of 4o scored 42%, and the human baseline is 83%.

@derikk after looking at the examples and not getting any correct and then seeing 83% as the human baseline I felt really bad till I read that humans were allowed to Google and use reverse image search.

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules