June 30, 2026
Key takeaways
  • Prompt injection attacks manipulate AI guardrails using natural language, exploiting the semantic gap to get models to ignore developer instructions.
  • AI social engineering scales faster and lowers attacker skill barriers, enabling automated, targeted campaigns like deepfakes and credential theft.
  • Primary harms include data exfiltration, unauthorized transactions, and malicious or biased outputs that damage reputation and operations.
  • Defenses are immature; require layered controls: human in the loop, prompt firewalls, input sanitization, least privilege, fuzz testing, patching, and user training.

Last Updated on June 30, 2026

Since organizations began using AI large language models (LLMs) in production they have mainly relied on so-called guardrails to reduce risks associated with unauthorized or unwanted AI output. These static safety rules, including prompt firewalls, content filters, classifiers, and system prompts, are still widely viewed as the foundation of AI cybersecurity.

But are even the most comprehensive AI guardrails actually effective? Or do they just raise the level of effort required for hackers—or even the AI systems themselves—to bypass these controls and commandeer the AI’s behavior? 

 

This article explains why conventional AI guardrails are inherently insufficient, and what businesses can do to comprehensively address their AI risk and protect sensitive assets. 

Key takeaways

  • A recently published mathematical proof demonstrates that static AI guardrails are fundamentally limited in their ability to provide protection from an infinite universe of adversarial natural language prompts. 
  • Guardrail limitations apply not just to external attacks on AI models, but also to AI agents’ ability to break their own rules and manifest unwanted results.
  • While AI guardrails may be imperfect, there are steps firms can take (e.g., AI red-teaming) to make them as robust as possible, thus increasing the level of effort required to jailbreak them.
  • Besides optimizing guardrails, businesses can leverage zero trust principles and other cybersecurity best practices to limit the risk and damage from inevitable successful attacks on AI systems.

What is the fatal flaw with static AI guardrails? 

On June 9, 2026, a senior scientist from the US National Institute of Standards and Technology (NIST) published a peer-reviewed paper in a major journal featuring a mathematical proof that AI guardrails can always be defeated and can never provide robust cybersecurity for AI systems. 

 

Based on Gödel’s incompleteness theorems, the proof shows that: “There will always be ways to prompt the AI that can make it disregard these rules.” No amount of prompt engineering or technical effort can ever fully close this inherent vulnerability. There will always be a way to “jailbreak” an AI system into disregarding its own rules. 

 

With more and more AI cybersecurity failures making the headlines, this insight comes as no surprise. Prompt injection attacks, which use malicious content embedded in user inputs or external data sources to control and redirect an AI system’s behavior, are the number one attack vector that LLM users face, according to the current OWASP Top 10 for LLM Applications list. Researchers have likewise confirmed that prompt injection attacks can be highly successful against the guardrails protecting commercial frontier AI models.

 

Why is it mathematically impossible to build effective AI guardrails? Basically, it is because prompts are based on natural language, which can have infinite variations. No finite set of rules can encompass a boundless potential diversity of prompt injection attacks. A rule that thwarts one attack will fail against a similar attack that uses a different prompt sequence or different specific wording (or even a different natural language) to achieve the same results.

 

But it’s not just hackers and researchers circumventing AI guardrails—the AI agents themselves are in on the game. A recent report from the Non-Human Identity Management Group shows that 80% of organizations using AI agents have already experienced unintended and dangerous results beyond the AI system’s intended scope. 

 

Common internal guardrail failures affecting agents in production include accessing unauthorized resources, disclosing sensitive data, and exposing credentials or other secrets. Teams often become aware of a guardrail failure only after the agent has done the damage. 

 

“The architecture of LLMs is such that the data and instructions are mixed,” explains Geoffrey Mattson, CEO of SecureAuth. “A good way to hack into an AI system is to use data to jump into the instruction area of memory and take it over. LLMs are inviting you to do that, essentially.”

“It’s not like a typical application where there’s a finite set of vulnerabilities and eventually we’ll discover them all and patch them,” Geoffrey Mattson adds. “This is something that can never be patched.”

If guardrails are a fail, how can companies effectively secure AI?

With AI technology advancing so rapidly, how can companies develop an effective AI cybersecurity strategy? What controls can provide protections in critical areas like authorization and authentication to protect sensitive assets?

Geoffrey Mattson advises: “Don’t try to come up with a ten-year AI strategy—just implement something right now that protects your business and lets your users run free while things are evolving.”

One approach is to optimize the conventional guardrails approach to make it far more difficult to jailbreak an AI system using prompt injection attacks. This option relies on three synergistic elements:

  1. Use AI red teaming (effectively a social engineering attack against the AI) to uncover new malicious prompts before hackers find them.
  2. Continuously harden your AI guardrails against new malicious prompts.
  3. Leverage zero trust principles and other cybersecurity best practices to limit the potential impacts and reduce recovery time from data breaches or other risks from uncontrolled AI. 

 

The goal is to make the effort required to hack the model financially prohibitive for cybercriminals. 

Balancing AI guardrail protections with AI agent functionality

Organizations need AI security and governance so they can maximize the return on AI investments while reducing associated risks to acceptable levels. One way to achieve this balance is to constrain agentic AI behavior with increasingly rigid prompt filters and output classifiers as red teaming identifies new vulnerabilities.

The challenge with this approach is that the agent’s behavior may become too limited to justify the high operational overhead. Firms can end up in either of two unsustainable positions:

  • Accepting high jailbreak risk to achieve acceptable agent functionality.
  • Over-blocking valid agent actions until employees find a workaround.

 

An emerging complement to “really good guardrails” is to infuse IT infrastructure with adaptive cybersecurity controls that give AI agents wide latitude but validate their identity and authority at a “security checkpoint” prior to permitting any action. In this scenario, agent security is based on capabilities like runtime workload identity, dynamic policy checks, and least privilege data access constraints. 

 

The goal is to protect sensitive assets and contain the blast radius when (not if) an attack succeeds. Organizations that can safeguard their high-value assets while empowering users and agentic systems derive not only business value preservation but also value creation from their cybersecurity investments.

What’s next?

For more guidance on this topic, listen to Episode 160 of The Virtual CISO Podcast with guest Geoffrey Mattson, CEO at SecureAuth.

Back to Blog