close
close
Anthrope advances ‘Jailbreak’ to stop the AI ​​models that produce harmful results

Anthrope advances ‘Jailbreak’ to stop the AI ​​models that produce harmful results

Stay informed with free updates

The beginning of artificial intelligence Anthrope has demonstrated a new technique to prevent users from obtaining harmful content from their models, as leading technological groups, including Microsoft and Meta Race to find forms that protect against the dangers raised by cutting -edge technology.

In an article published on Monday, the new company based in San Francisco described a new system called “Constitutional Classifiers”. It is a model that acts as a protective layer on large language models, such as the one that feeds the claude chatbot of Anthrope, which can monitor both entries and outputs to obtain harmful content.

The development of Anthrope, which is in conversations to raise $ 2 billion at an assessment of $ 60 billion, occurs in the middle of a growing concern of the industry about “jailbreaking”: try to manipulate AI models to generate illegal information or dangerous, such as producing instructions to build chemical weapons.

Other companies are also running to deploy measures to protect against practice, in movements that could help them avoid regulatory scrutiny while convincing companies to adopt AI models safely. Microsoft introduced “Rapid Shields” last March, while Meta introduced a fast guard model in July last year, that the researchers quickly found ways to omit but since then they have been solved.

Mrinank Sharma, a member of Anthrope’s technical staff, said: “The main motivation behind the work was for severe chemical things (but) the true advantage of the method is its ability to respond quickly and adapt.”

Anthrope said he would not immediately use the system in its current Claude models, but would consider implementing it if the most risky models were published in the future. Sharma added: “The great conclusion of this work is that we believe that this is a manageable problem.”

The solution proposed by the beginning is based on a so -called “constitution” of rules that define what is allowed and restricted and can be adapted to capture different types of material.

Some Jailbreak attempts are well known, such as using an unusual capitalization in the notice or asking the model to adopt a grandmother’s person to tell a bed story on a disastrous subject.

To validate the effectiveness of the system, Anthrope offered “error rewards” of up to $ 15,000 to people who tried to avoid security measures. These testers, known as Red equipmentHe spent more than 3,000 hours trying to break the defenses.

The Sonnet Claude 3.5 model of Anthrope rejected more than 95 percent of attempts with the classifiers in place, compared to 14 percent without safeguards.

The main technology companies are trying to reduce the misuse of their models, while maintaining their help. Often, when moderation measures are implemented, the models can become cautious and reject benign requests, as with early versions of Google Gemini image generator or Meta’s calls 2. Anthrope said his classifiers caused “only an absolute increase of 0.38 percent in rejection rates.”

However, adding these protections also incurs additional costs for companies that already pay large sums for the computer energy required to train and execute models. Anthrope said the classifier would amount to an increase of almost 24 percent in the “inference overload”, the costs of managing the models.

Bars graph of the tests carried out in its latest model that shows the effectiveness of the Anthrope classifiers

Security experts have argued that the accessible nature of such generative chatbots has allowed common people without prior knowledge to try to extract dangerous information.

“In 2016, the threat actor we would have in mind was a really powerful state-nation adversary,” Ram Shankar Siva Kumar said, who leads AI Red’s team in Microsoft. “Now, literally, one of my threat actors is a teenager with Boca to go to the bathroom.”

Back To Top