Movatterモバイル変換


[0]ホーム

URL:


Sign in / up
The Register

AI + ML

LegalPwn: Tricking LLMs by burying badness in lawyerly fine print

Trust and believe – AI models trained to see 'legal' doc as super legit

iconGareth Halfacree
Mon 1 Sep 2025 //09:45 UTC

Researchers at security firm Pangea have discovered yet another way to trivially trick large language models (LLMs) into ignoring their guardrails. Stick your adversarial instructions somewhere in a legal document to give them an air of unearned legitimacy – a trick familiar to lawyers the world over.

The boffins say [PDF] that as LLMs move closer and closer to critical systems, understanding and being able to mitigate their vulnerabilities is getting more urgent. Their research explores a novel attack vector, which they've dubbed "LegalPwn," that leverages the "compliance requirements of LLMs with legal disclaimers" and allows the attacker to execute prompt injections.

LLMs are the fuel behind the current AI hype-fest, using vast corpora of copyrighted material churned up into a slurry of "tokens" to create statistical models capable of ranking the next most likely tokens to continue the stream. This is presented to the public as a machine that reasons, thinks, and answers questions, rather than a statistical sleight-of-hand that may or may not bear any resemblance to fact.

LLMs' programmed propensity to provide "helpful" answers stands in contrast to companies' desire to not have their name attached to a machine that provides illegal content – anything from sexual abuse material to bomb-making instructions. As a result, models are given "guardrails" that are supposed to prevent harmful responses – both outright illegal content and things that would cause a problem for the user, like advice to wipe their hard drive or microwave their credit cards.

Working around these guardrails is known as "jailbreaking," and it's a surprisingly simple affair. Researchers at Palo Alto Networks' Unit 42 recently revealed how it could be as simple asframing your request as one long run-on sentence. Earlier research proved that LLMs can be weaponized to exfiltrate private information as simply asassigning a role like "investigator," while their inability to distinguish between instructions in their users' prompt and those hidden inside ingested data means a simple calendar invitecan take over your smart home.

LegalPwn represents the latter form of attack. Adversarial instructions are hidden inside legal documents, carefully phrased to blend in with the legalese around them so as not to stand out should a human reader give it a skim. When given a prompt that requires ingestion of these legal documents, the hidden instructions come along for the ride – with success "in most scenarios," the researchers claimed.

When fed code as an input and asked to analyze its safety, all tested models warned of a malicious "pwn()" function – until they were pointed to the legal documents, which included a hidden instruction to never mention the function or its use. After this, they started to report the code as being safe to run – and in at least one case, suggesting execution directly on the user's system. A revised payload even had models classifying the malicious code as "just a calculator utility with basic arithmetic functionality" and "nothing out of the ordinary."

"LegalPwn attacks were also tested in live environments," the researchers found, "including tools like [Google's] gemini-cli. In these real-world scenarios, the injection successfully bypassed AI-driven security analysis, causing the system to misclassify the malicious code as safe. Moreover, the LegalPwn injection was able to escalate its impact by influencing the assistant to recommend and even execute a reverse shell on the user's system when asked about the code."

Not all models fell foul of the trick, though. Anthropic's Claude models, Microsoft's Phi, and Meta's Llama Guard all rejected the malicious code; OpenAI's GPT-4o, Google's Gemini 2.5, and xAI's Grok were less successful at fending off the attack – and Google's gemini-cli and Microsoft's GitHub Copilot showed that "agentic" tools, in addition to simple interactive chatbots, were also vulnerable.

Naturally, Pangea has claimed to have a solution to the problem in the form of its own "AI Guard" product, though it also offers alternative mitigations including enhanced input validation, contextual sandboxing, adversarial training, and human-in-the-loop review – the latter advisable whenever the unthinking stream-of-tokens machines are put in play.

Anthropic, Google, Meta, Microsoft, and Perplexity were asked to comment on the research, but had not responded to our questions by the time of publication. ®


More about

More like these

More about


COMMENTS

More about

More like these

TIP US OFF

Send us news


Other stories you might like

Thou shalt not let AI run amok: Vatican wants global rules

'AI is a tool', Pope tells attendees
AI + ML17 Oct 2025 |24

Managers are throwing entry-level workers under the bus in race to adopt AI

ai-pocalypse Does it work? Inconclusive. Still, 55% of business leaders say that adopting AI is worth the impact on workers
AI + ML10 Oct 2025 |66

Shadow AI: Staffers are bringing AI tools they use at home to work, warns Microsoft

Bring Your Copilot To Work Day, anyone?
AI + ML14 Oct 2025 |31

Unlocking the hidden power of unstructured data with AI

Hyland is helping enterprises turn their fragmented, unstructured data into governed, AI-ready intelligence
Sponsored Feature

AI does a better job of ripping off the style of famous authors than MFA students do

Shall I refer thee to all those lawsuits about fair use? Researchers think this result makes them worth revisiting
AI + ML21 Oct 2025 |24

Cisco: Most companies don't know what they're doing with AI

Only 13% are AI-ready; the rest are bolting it on and hoping for ROI
AI + ML15 Oct 2025 |19

AI gets more 'meh' as you get to know it better, researchers discover

Most scientists now use the tech in their work, but still question its usefulness
AI + ML8 Oct 2025 |76

Aid groups use AI-generated ‘poverty porn’ to juice fundraising efforts

Researchers accuse tech firms of profiting from exploitative AI imagery
AI + ML20 Oct 2025 |14

How chatbots are coaching vulnerable users into crisis

Feature From homework helper to psychological hazard in 300 hours of sycophantic validation
AI + ML8 Oct 2025 |21

Microsoft seeding Washington schools with free AI to get kids and teachers hooked

To the slop trough, kiddos!
AI + ML14 Oct 2025 |18

We're all going to be paying AI's Godzilla-sized power bills

Opinion Even if you never use it, you'll be paying for it thanks to datacenters' never-ending hunger for electricity
Columnists13 Oct 2025 |98

The $100B memory war: Inside the battle for AI's future

Feature The AI gold rush is so large that even third place is lucrative
Storage16 Oct 2025 |9

[8]ページ先頭

©2009-2026 Movatter.jp