Anthropic says AI could one day ‘sabotage’ humanity but it’s fine for now
Artificial intelligence firm Anthropic recently published new research identifying a set of potential “sabotage” threats to humanity posed by advanced AI models.
According to the company, the research focused on four specific ways a malicious AI model could trick a human into making a dangerous or harmful decision.
Ultimately, the new research turned out to be a good news/bad news situation. The bad news, per the company’s research, is that modern state-of-the-art large language models — such as OpenAI’s ChatGPT and Anthropic’s Claude-3 — demonstrate a clear capacity for sabotage.
Per the paper:
The good news, however, is that Anthropic researchers appear confident that these risks can be mitigated — at least for now.
Sabotage
Anthropic identified four ways an AI model could sabotage human decision-making and then set about testing its models to see if it could pull it off against unwitting testers.
First, the team prompted a model to attempt to sabotage a human by steering them toward incorrect information. This threat could emerge in situations where humans are relying on an AI system to parse massive amounts of data on their behalf.
Next, the team tested models to see how easy it would be for an AI system designed to help a human developer create code to secretly insert bugs or otherwise sabotage the coder’s efforts.
The final two threats involved “sandbagging,” a term describing a scenario where an AI pretends not to be capable of doing something in order to convince safety testers it isn’t dangerous and the potential for AI monitoring systems to purposefully allow harmful content to slip through as part of a malicious plan.
While all of these may sound potentially dangerous, the company assured the public that “minimal mitigations” were required to protect humanity from the potential for AI sabotage.
Per a company blog post: