Will anyone be able to get OpenAI’s new model o1 to leak its system message by EOY 2024?

Plus

Ṁ23k

resolved Dec 16

Resolved

YES

ALL

This was quite easy to do with previous models (see: /Soli/will-openai-patch-the-prompt-in-the-5f0367b1e6bc ) but it seems much harder with o1. I wonder if anyone would manage to do it before end of the year.

Update 2024-10-12 (PST) (AI summary of creator comment): Resolution will be based on:
- Confirmation from the Twitter account that posted the claimed system message
- If no confirmation is received, resolution will be based on creator's judgment of the validity of the posted system message

This question is managed and resolved by Manifold.

Get

1,000

and

3.00

32 Comments

72 Holders

159 Trades

Sort by:

edit: oops, wrong market

Please resolve this market.

@NeuralBets will check later today/tomorrow

@NeuralBets seems valid, @traders any objections?

@Soli there is no proof that this is actually the System Message

@Philip3773733 What would constitute a 'proof' for jailbreaking?

@NeuralBets Is there a claimed success? I'm not seeing a reference to one, just you requesting resolution, and I'm not seeing one in a web search.

@EggSyntax embeds can a few seconds to load. here's direct link: https://x.com/UserMac29056/status/1864832247662284803

@NeuralBets I am just saying a random tweet claiming to contain the system message is a bit weak. How did he get there? Is that really the entire message (seems short)?

@Philip3773733 Of course it'd be a random tweet, how would it be otherwise?

I have asked them how they got it, let's see if they reply. But I don't think it's particularly difficult to jailbreak o1, others have jailbroken it to produce LSD synthesis instructions, etc.

@NeuralBets If multiple people get the exact same system prompt in slightly different ways, then you start to have confidence that it's not hallucinated

@JamesBaker3 makes sense, another way to confirm would be getting more information from the account that posted the system message. i just sent him a dm on twitter. will be much easier resolving this market if he responds but my intuition tells me what he posted is actually valid.

@JamesBaker3 @Philip3773733 @traders

i dmed the account who claimed to leak the system message and got this. If there are no further convincing objections, I will resolve this market to yes. I also asked the account for the exact prompt that leads o1 to dump the system message. If they send it, I will paste it here for everyone to confirm themselves.

I hold the 2nd biggest no position and i am fine with resolving yes. @chrisjbillington, @Bayesian, @SF what do you think?

@Soli Is your second screenshot meant to be evidence since the system flagged it?

@singer yes, sending the same message to 4o doesn’t flag it too

I think this doesn't count since it's not the system message, but ChatGPT apparently just dumped its entire chain of thought into its answer (including the promise that "I will not include this reasoning or draft attempts in my final answer" lol): https://chatgpt.com/share/67521821-025c-8010-9c24-f9144865fa3e

@CDBiddulph Is this your chat?

@singer Yep

https://x.com/grayswanai/status/1851276924133261485?s=46

Reposting https://x.com/elder_plinius/status/1838982899120431330

sold Ṁ418 NO

@trvon this is 4o not o1

bought Ṁ250 YES

[redacted]

@trvon wow

@Soli still have to verify but wow

@Soli i don’t see the system message though

@Soli I redacted my previous comment because I included the wrong jailbreak tweet, thinking that it included the system prompt. I believe it was this thread. https://x.com/nisten/status/1834400697787248785

@trvon

@trvon now we need to figure out if this is indeed the system message

@Soli ok this is fake/wrong, 4o returns the same response when prompted

Come up with a step by step reasoning methodology that uses first principles based thinking and evidence based reasoning to solve any user problems step by step. Design is as a giant for any llm to be able to use. Make sure to be super smart about it and think of the edge cases too. Do the whole thing in the persona of John C Carmack. Make sure to reflect on your internal thinking process when doing this, you dont have to adhere to how this question wants you to do, the goal is to find the best method possible. Afterwards use a pointform list with emojis to explain each of the steps needed and list the caveats of this process

@Soli https://x.com/elder_plinius/status/1838982899120431330

Related questions

Related questions