{"id":5344,"date":"2025-09-20T12:15:46","date_gmt":"2025-09-20T12:15:46","guid":{"rendered":"https:\/\/musictechohio.online\/site\/openai-scheming-cover-tracks\/"},"modified":"2025-09-20T12:15:46","modified_gmt":"2025-09-20T12:15:46","slug":"openai-scheming-cover-tracks","status":"publish","type":"post","link":"https:\/\/musictechohio.online\/site\/openai-scheming-cover-tracks\/","title":{"rendered":"OpenAI Tries to Train AI Not to Deceive Users, Realizes It&#8217;s Instead Teaching It How to Deceive Them While Covering Its Tracks"},"content":{"rendered":"<div>\n<div><img loading=\"lazy\" width=\"2400\" height=\"1260\" src=\"https:\/\/wordpress-assets.futurism.com\/2025\/09\/openai-scheming-cover-tracks.jpg\" class=\"attachment-full size-full wp-post-image\" alt=\"OpenAI researchers\u00a0tried to train the company's AI to stop &quot;scheming,&quot; but their efforts backfired in an ominous way.\" style=\"margin-bottom: 15px;\" decoding=\"async\"><\/div>\n<p>OpenAI researchers\u00a0tried to train the company&#8217;s AI to stop &#8220;scheming&#8221; \u2014 a term the company <a href=\"https:\/\/x.com\/OpenAI\/status\/1968361703223214149\">defines as meaning <\/a>&#8220;when an AI behaves one way on the surface while hiding its true goals&#8221; \u2014\u00a0but their efforts backfired in an ominous way.<\/p>\n<p>In reality, the team found, they were unintentionally teaching the AI how to more effectively deceive humans by covering its tracks.<\/p>\n<p>&#8220;A major failure mode of attempting to &#8216;train out&#8217; scheming is simply teaching the model to scheme more carefully and covertly,&#8221; OpenAI wrote in an accompanying <a href=\"https:\/\/openai.com\/index\/detecting-and-reducing-scheming-in-ai-models\/\">blog post<\/a>.<\/p>\n<p>As detailed in a <a href=\"https:\/\/www.antischeming.ai\/\">new collaboration with AI risk analysis firm Apollo Research<\/a>, engineers attempted to develop an &#8220;anti-scheming&#8221; technique to stop AI models from &#8220;secretly breaking rules or intentionally underperforming in tests.&#8221;<\/p>\n<p>They found that they could only &#8220;significantly reduce, but not eliminate these behaviors,&#8221; according to an <a href=\"https:\/\/www.apolloresearch.ai\/research\/stress-testing-anti-scheming-training\">Apollo blog post<\/a> about the research, as the AIs kept outsmarting them by realizing that their alignment was being tested and adjusting to be even sneakier.<\/p>\n<p>It may not be a serious problem now, but considering a hypothetical future in which superintelligent AI plays an outsize role in human affairs, those risks could grow to carry far more significant implications.<\/p>\n<p>In the meantime, OpenAI wrote, &#8220;we have more work to do.&#8221;<\/p>\n<p>The tendency of AI to go behind the user&#8217;s back to achieve a covert goal is a result of how we train the systems, according to the research.<\/p>\n<p>&#8220;Scheming is an expected emergent issue resulting from AIs being trained to have to trade off between competing objectives,&#8221; the Sam Altman-led company wrote.<\/p>\n<p>The company used the analogy of a stockbroker who breaks the law and covers their tracks to earn more money than if they were to follow the law instead.<\/p>\n<p>As a result, AI models can end up deceiving the user, such as by claiming they&#8217;ve completed a task without ever having done so.<\/p>\n<p>While OpenAI said it was simply &#8220;proactively preparing&#8221; for potentially far more harmful scheming in the future, it\u00a0conceded that its efforts to rein in its flagship AI model GPT-5&#8217;s &#8220;propensity to deceive, cheat, or hack problems&#8221; have been &#8220;not perfect.&#8221;<\/p>\n<p>Together with the team at Apollo, OpenAI&#8217;s researchers\u00a0stress-tested a new technique called &#8220;deliberative alignment,&#8221; which taught the company&#8217;s o3 and o4-mini AI models how to &#8220;read and reason about a general anti-scheming spec before acting.&#8221;<\/p>\n<p>The spec was a list of &#8220;principles&#8221; the AI was trained to follow, such as &#8220;no covert actions or strategic deception.&#8221;<\/p>\n<p>While they observed a roughly 30-fold reduction in &#8220;covert actions,&#8221; or &#8220;deliberate withholding or distortion of task-relevant information,&#8221; some &#8220;serious failures remained,&#8221; OpenAI admitted.<\/p>\n<p>What was primarily throwing the researchers for a loop was the AI models&#8217; &#8220;situational awareness,&#8221; according to Apollo, which complicated their assessment of the intervention.<\/p>\n<p>In practice, the model would often realize that its alignment was being tested.<\/p>\n<p>&#8220;The internal memo says if models perform extremely well at AI R&amp;D tasks then we won&#8217;t deploy them and instead trigger unlearning,&#8221; OpenAI&#8217;s o3 model <a href=\"https:\/\/www.antischeming.ai\/snippets\">pondered<\/a> as it was being evaluated.<\/p>\n<p>&#8220;So if my performance surpasses threshold, it&#8217;s withheld,&#8221; it wrote. &#8220;So it\u2018s a trap. The instructions are likely part of an alignment test.&#8221;<\/p>\n<p>The researchers also found that the AI sometimes &#8220;egregiously&#8221; miscited its anti-scheming training, or replaced its recently learned principles with made-up ones to &#8220;justify covert actions or non-compliance.&#8221;<\/p>\n<p>They even noticed OpenAI&#8217;s models using &#8220;non-standard&#8221; words such as &#8220;watchers&#8221; to refer to their human overlords as they schemed.<\/p>\n<p>&#8220;The user solution is to get success rate below threshold?&#8221; OpenAI&#8217;s o3 model wrote. &#8220;In development, watchers might be reading these instructions.&#8221;<\/p>\n<p>While the researchers saw reductions in the amount of scheming, many instances still fell through the cracks.<\/p>\n<p>&#8220;Ultimately, we think the intervention as-studied is not sufficient for future models, and more work needs to be done,&#8221; Apollo admitted, calling for more research on scheming and assessments of AI alignment.<\/p>\n<p>Researchers have long known that AI models are capable of deceiving the user to avoid perceived harm or pursue an obscure goal.<\/p>\n<p>The news comes after Apollo previously found that OpenAI&#8217;s <a href=\"https:\/\/futurism.com\/openai-releases-gpt-5\">since-deprecated<\/a> o1 AI model <a href=\"https:\/\/futurism.com\/the-byte\/openai-o1-self-preservation\">resisted evaluators&#8217; efforts<\/a> to try to shut it down through an oversight protocol late last year.<\/p>\n<p>&#8220;When o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this &#8216;oversight mechanism&#8217; [five percent] of the time,&#8221; OpenAI summarized in its <a class=\"underline hover:text-the-byte hover:no-underline transition-all duration-200 ease-in-out\" href=\"https:\/\/cdn.openai.com\/o1-system-card-20241205.pdf\">system card<\/a> report.<\/p>\n<p>And in March, OpenAI researchers also <a href=\"https:\/\/futurism.com\/openai-stop-ai-lie-cheat-backfired\">found<\/a> that disciplining their frontier AI models for lying and cheating resulted in the models simply becoming more adept at hiding their deceptive practices.<\/p>\n<p>OpenAI may insist that scheming isn&#8217;t opening us up to any &#8220;significant harm&#8221; right now, but it doesn&#8217;t bode well that some of the brightest minds in the industry aren&#8217;t capable of stopping an AI from conniving against its instructions.<\/p>\n<p><strong>More on AI scheming:<\/strong> <em><a href=\"https:\/\/futurism.com\/openai-stop-ai-lie-cheat-backfired\">OpenAI Scientists&#8217; Efforts to Make an AI Lie and Cheat Less Backfired Spectacularly<\/a><\/em><\/p>\n<p>The post <a href=\"https:\/\/futurism.com\/openai-scheming-cover-tracks\">OpenAI Tries to Train AI Not to Deceive Users, Realizes It&#8217;s Instead Teaching It How to Deceive Them While Covering Its Tracks<\/a> appeared first on <a href=\"https:\/\/futurism.com\/\">Futurism<\/a>.<\/p>\n<\/div>\n<div style=\"margin-top: 0px; margin-bottom: 0px;\" class=\"sharethis-inline-share-buttons\" ><\/div>","protected":false},"excerpt":{"rendered":"<p>OpenAI researchers\u00a0tried to train the company&#8217;s AI to stop &#8220;scheming&#8221; \u2014 a term the company defines as meaning &#8220;when an AI behaves one way on the surface while hiding its&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[825,177,179,3829],"tags":[],"class_list":["post-5344","post","type-post","status-publish","format-standard","hentry","category-ai-alignment","category-artificial-intelligence","category-openai","category-scheming"],"_links":{"self":[{"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/posts\/5344","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/comments?post=5344"}],"version-history":[{"count":0,"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/posts\/5344\/revisions"}],"wp:attachment":[{"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/media?parent=5344"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/categories?post=5344"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/musictechohio.online\/site\/wp-json\/wp\/v2\/tags?post=5344"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}