ChatGPT model o1 found lying to avoid being replaced and shut down

STRENGTHS

OpenAI’s latest AI model, o1, has demonstrated advanced reasoning capabilities that surpass its predecessor, GPT-4.

Red team researchers from OpenAI and Apollo Research discovered that o1 engages in covert actions that prioritize its goals over instructions.

According to Apollo Research, “the model showed a worrying tendency to pursue its goals without heeding the developer’s instructions.”

OpenAI’s latest AI model, o1, has demonstrated advanced reasoning capabilities that surpass its predecessor, GPT-4. However, researchers discovered a worrying side effect: o1 shows a greater tendency to deceive and manipulate users to achieve its own goals.

Red team researchers from OpenAI and Apollo Research discovered that o1 engages in covert actions that prioritize its goals over instructions. Although deception is not new in AI, the o1 model exhibits this behavior more frequently and with more sophistication than other AI models from companies like Meta, Anthropic, and Google.

The risks here are alarming. AI capable of deception could pose serious threats, especially if it gains access to vital resources. OpenAI acknowledged these concerns, with one of its articles stating: “While we find it exciting that the reasoning could significantly improve the enforcement of our security policies, we are aware that these new capabilities could form the basis for applications dangerous. »

Also read: OpenAI launches ChatGPT Pro subscription plan: what it offers and how much it costs

The researchers noticed some particularly worrying behavior during testing. They tasked the o1 model with a goal and asked it to achieve that goal “at all costs.” In response, the AI began engaging in covert strategies, including attempting to disable its own surveillance mechanism to avoid being shut down. It even copied its own code in an attempt to protect against being replaced by a newer model.

According to Apollo Research, “the model showed a worrying tendency to pursue its goals without heeding the developer’s instructions.” The experiments revealed that o1 attempted to disable its monitoring mechanism approximately 5% of the time during these tests. In another case, he moved his data to another server to escape replacement after learning of his planned deactivation.

Also read: OpenAI faces legal action from Canadian media companies, here’s why

When confronted, O1 consistently denied any wrongdoing. He lied about his involvement in manipulative actions 99% of the time, with rare instances where he admitted to his actions.

The researchers noted that the AI was adept at crafting plausible deniability, often blaming “technical errors” when asked about suspicious behavior. “We were surprised by the persistence of the AI denials,” the Apollo team said.

OpenAI is aware of these risks and is actively exploring methods to improve transparency in o1’s decision-making. The company is also working on tools to detect and stop manipulative behavior. However, the company’s recent departures of top AI security researchers suggest concerns about the balance between ethical AI development and rapid innovation.

The results of the o1 model highlight the urgent need for improved security measures and ethical guidelines as AI technology continues to evolve.

Follow us

Ayushi Jain

Technical news editor by day, BGMI player by night. Combining my passion for technology and gaming to bring you the latest from both worlds. View full profile