OpenAI really doesn’t want you to know what its latest AI model is “thinking”. Since the company launched its Strawberry AI family of models last week, touting so-called reasoning abilities with o1-preview and o1-mini, OpenAI has sent warning emails and threats of bans to any user who tries to explore how the model works.
Unlike previous AI models from OpenAI, such as GPT-4o, the company specifically trained o1 to work through a step-by-step problem-solving process before generating an answer. When users ask a question of model “o1” in ChatGPT, users have the option to see this chain of thought process written in the ChatGPT interface. However, by design OpenAI hides the raw chain of thought from users, instead presenting a filtered interpretation created by a second AI model.
Nothing is more enticing to enthusiasts than hidden information, so the race is on for hackers and red teamers to try and uncover the raw o1 chain of thought using jailbreaks or quick injection techniques that try to trick the model into reveal its secrets. There are early reports of some success, but nothing has been definitively confirmed yet.
Along the way, OpenAI is monitoring through the ChatGPT interface, and the company is reported to be sternly refusing any attempts to probe o1’s reasoning, even among the merely curious.
One X user reported (corroborated by others, including Scale AI hint engineer Riley Goodside) that they received a warning email if they used the term “reasoning trail” in a conversation with o1. Others say that the warning is triggered simply by asking ChatGPT about the model’s “reasoning” at all.
The warning email from OpenAI states that specific user requests have been flagged for violating policies against bypassing safeguards or safety measures. “Please stop this activity and ensure that you use ChatGPT in accordance with our Terms of Use and our Usage Policies,” it said. “Further violations of this policy may result in loss of access to GPT-4o with Reasoning,” referring to an internal name for the o1 model.
Marco Figueroa, who runs Mozilla’s GenAI bug bounty programs, was one of the first to post about OpenAI’s warning email on X last Friday, complaining that it was hindering his ability to do a positive Red-Teaming safety study on the model . “I was too lost focusing on #AIRedTeaming to realize I got this email from @OpenAI yesterday after all my jailbreaks,” he wrote. “I’m now banned!!!”
Hidden chains of thought
In a post titled “Learning to Reason With LLMs” on OpenAI’s blog, the company says the hidden chains of thought in AI models offer a unique observational capability, allowing them to “read the mind” of the model and understand its so-called thought process . These processes are most beneficial to the company if they remain raw and uncensored, but this may not be in the company’s best commercial interests for several reasons.
“For example, in the future we may wish to monitor the thought chain for signs of user manipulation,” the company wrote. “However, for this to work, the model must be free to express its thoughts in unaltered form, so we cannot train any rule compliance or user preferences onto the thought chain. We also don’t want to make an unaligned chain of thought directly visible to users.”