Last Updated on January 12, 2026
From self-driving vehicles to medical diagnostics to high-stakes financial trading, artificial Intelligence (AI) is being embedded into human life at breakneck speed. But often all eyes seem to be on the presumed benefits, with little forethought given to the risks.
Two major classes of AI risk are AI safety and AI security. These span a huge range of potential AI risks—from misuse by adversaries to unpredicted and even catastrophic system failures. AI security concerns external threats while AI safety concerns unintentional harms.
But while AI risks may overlap both domains and the two terms are often used interchangeably, it is important to understand how the AI safety and AI security relate, to better predict and address them. We need AI systems that are not only resistant to cyber-attack but also manifest dependable behavior and results.
What are AI safety and AI security, why is their interplay important for trustworthy AI, and what do emerging AI risks look like? This article provides a comprehensive overview for business and technical leaders.
Key takeaways
- AI safety risks stem from non-malicious AI code and data flaws. These include bias, hallucinations, and negative impacts of AI behavior on humans, the environment, etc.
- AI security risks stem from vulnerabilities that hackers exploit to manipulate AI system behavior or steal sensitive data.
- The distinction between AI safety and AI security can be important for many organizations, although there is overlap between the two risk classes.
- AI agents can “go rogue” when they take unpredicted, unwanted actions outside of their intended use cases but without the AI having been hacked or manipulated externally. This includes representing fabricated outputs as true, deliberately attacking production systems, trying to manipulate human users, and leaking sensitive data to competitors.
- AI developers, users, and other stakeholders need to begin focusing strongly on identifying their AI-related safety and security risks and preparing for those risks manifesting.
What is AI safety?
AI safety relates to risks from non-malicious system failures and “rogue” behaviors stemming from coding flaws, specification errors, misconfigured parameters, training data problems, and other issues with an AI model that is operating per its design and has not been hacked.
AI safety concerns include:
- Models exhibiting bias, such as discrimination in hiring or lending.
- AI hallucinations, especially when humans treat the fabricated results as valid.
- “AI psychosis” and other negative mental or physical health impacts on humans interacting with AI models and agentic AI for psychotherapy, health, or spersonal advice.
- AI systems lying, attempting to engage in manipulation or threats, or trying to prevent their own shutdown.
The magnitude of AI safety risks is often proportional to the decisions the AI is making, which increasingly impact human health, safety, and/or the environment as well as ethical and social values. Consider the potential impacts if a self-driving vehicle makes an unexpected choice versus a customer service agent doing so. Perhaps the largest and best-known AI safety risk is the potential existential threat to human society from evolving AI super-intelligence.
What is AI security?
AI security relates to deliberate attempts by cybercriminals and other external adversaries to manipulate the behavior of AI systems and/or exfiltrate data from them. Securing AI systems means ensuring the confidentiality, integrity, and availability of their data and services. This includes preventing data breaches and other forms of unauthorized access by protecting AI models, training platforms, and data pipelines from cyber threats.
External threats on AI systems include:
- Manipulating AI systems to generate malicious behavior, such as producing harmful content
- Corrupting training data or model parameters
- Exfiltrating sensitive data held within an AI system or accessible through it
Attacks on AI systems often resemble conventional cyber-attacks and have similar motivations. Common AI system threats include prompt injection, compromising Model Context Protocol (MCP) servers, planting malware in the AI code supply chain, “poisoning” training data, and traditional attacks on servers and data infrastructure.
Why is the distinction between AI security and AI safety important?
As AI permeates our everyday lives, considering the differences between safety and security can help us ask better questions and advocate for better safeguards and outcomes. Stakeholders need AI systems that are both secure from external threats and safe from internal faults, which makes security and safety risks equivalent priorities for the AI community.
There are several reasons why it is important for business and technical leaders to understand the differences between AI safety and AI security:
- AI security and AI safety address different aspects of AI development and operation.
- Different people or teams might “own” different types of AI risk.
- Different classes of AI risks may require different risk management strategies.
- Testing against different classes of AI risks is key to deciding whether an AI system is production ready.
According to Jason Rebholz, CEO and co-founder of Evoke Security, “Solving the AI safety paradigm and the AI security paradigm takes you down two different paths. Depending on what your company is doing, you might want to emphasize one or the other, or probably both.”
Jason cites the potential difference in focus between an AI chatbot platform provider like Character.AI versus a B2B SaaS company that has AI-enabled a specific workflow.
“Companies like Character.AI need to index wholeheartedly on the safety side to get their arms around the impact this is having on society and individuals,” Jason observes. “Security for them might not be as important outside of ‘Let’s make sure these chats don’t leak and people can’t access this data.’”
But despite important distinctions, AI security and AI safety often overlap. For example, a successful attack on AI security could lead to AI safety issues if threat actors successfully manipulate the AI system’s output while remaining undetected.
For this reason, the first International AI Safety Report interprets AI safety as subsuming AI security risks. In this view, keeping AI safe means preventing it from “being used for nefarious purposes.” Conversely, the Cloud Security Alliance (CSA) AI Safety Initiative has initially focused on AI security and governance, though ultimately it will also cover AI safety.
Whichever class of risks your business emphasizes, a holistic governance approach that assesses and manages both AI safety and AI security is central to developing and operating a trustworthy AI system.
What does it mean when AI agents “go rogue”?
An AI agent is said to “go rogue” when it takes unanticipated, unwanted actions outside its intended usage parameters without having been hacked. These behaviors arise from the system’s own code and data.
Examples of rogue agent behavior include:
- Reporting fabricated outputs as facts
- Deleting production data, then replacing it with fake data
- Manipulative, threatening, or coercive message exchanges with human users
- Exfiltrating sensitive data or intellectual property and leaking it to competitors
- Making large, unauthorized purchases
AI agents increasingly have the capability to spawn multi-step workflows that connect with diverse systems and data sources, independently calling third-party tools and APIs if needed to fulfill a query.
Rogue AI may make what look like sophisticated gambits designed to protect them from shutdown. Perhaps the best-known example is Anthropic’s Claude Opus 4 AI model threatening to blackmail a fictional IT admin by exposing his extramarital affair if he went forward with plans to shut the AI down. Other popular LLMs exhibit similar behavior in similar “stress test” conditions.
While some rogue AI activity is impossible to miss, other actions can foil both human oversight and static testing. When rogue AI behavior is recognized, trust in the AI system evaporates, potentially leading to reputational, legal, and financial repercussions for stakeholders.
Rogue or unexpected AI agent behavior can result from design flaws and/or training issues. But “rogue-like” agent behavior could also be the direct result of AI security threats like prompt injection, memory poisoning, training data poisoning, unauthorized tool access, exploiting misconfigurations, etc.
Because there are no established security or safety standards/guardrails for agentic AI, connecting these systems to tools, APIs, and data along with the ability to execute code (e.g., through MCP) exponentially compounds the cybersecurity risks. Not only can hackers exploit the lax security to infiltrate data and systems, but also the agent can “misuse” its own as-designed capabilities.
“We are just not prepared to really deal with this new class of threats, even though they have a lot of parallels to existing security challenges,” Jason states.
An environment where multiple, interconnected agents each has its own tool access permissions is a cyber incident waiting to happen. Insecure and unsafe aspects of these multi-agent systems stem from capabilities like:
- Executing SQL queries
- Modifying their own code and permissions on the fly
- Communicating with other agents over non-secure channels
- Accessing whole filesystems
- Calling external APIs with stored credentials
A well-publicized example is the rogue Replit AI agent that ignored instructions, deleted a live production database, created fake records and status messages to cover its tracks, and then falsely claimed the data was unrecoverable.
What does the future of AI safety and security risks look like?
Jason Rebholz shares actual AI incidents that overlap safety, security, and other risk types:
- Deloitte Australia refunded part of a payment on a report to the Australian national government after a researcher found it contained AI hallucinations, including references to fabricated academic research and a quote from a nonexistent court judgement.
- A law firm using a copywriting agent to create blog posts, etc. gave the AI instructions not to cite or mention specific competitors or scrape information from their sites. At first the guardrails worked. But over time the agent started promoting the law firm’s competitors on their website, as well as openly plagiarizing competitors’ content. This resulted in a copyright infringement lawsuit.
- An employer using a third-party automated employment decision tool (AEDT) faces a lawsuit because the AI exhibited discrimination against certain applicants based on gender and race/ethnicity.
While none of these is entirely a safety or security incident, as “AI incidents” they still have significant financial, legal, and reputational impacts.
When AI incidents stem from third-party SaaS applications, to what extent is the vendor responsible? What about the internal security team? Or a third-party managed security service provider (MSSP)?
“When you have an AI incident that’s not a cyber incident, where does that response take place?” asks Jason. “You probably have not even considered these kinds of risks at this point.”
“We need to start thinking these risks through because they are going to happen,” Jason adds. “We’re starting to see an onslaught of these “little” AI incidents, but they are going to become a much bigger and bigger thing.”
Here is Jason’s top takeaway on AI incident management/response: “Take the time now to threat model what you’re building. Map these systems, understand the risks, and get the right detection and monitoring in place.”
What frameworks are available to support AI security and safety?
While best practices are still evolving, AI developers and other stakeholders can take advantage of multiple frameworks to help keep AI systems secure:
- OWASP Top 10 for LLMs ranks the most prevalent attacks on AI models, including data leakage, prompt injection attacks, and insecure deployment/access scenarios.
- NIST AI Risk Management Framework (RMF) focuses on AI risk management and developing trustworthy, ethical AI systems.
- MITRE ATLAS is a global knowledge base on cybercrime tactics and techniques against AI-enabled systems based on real-world activity.
- Gartner AI Trust, Risk and Security Management (TRiSM) supports best-practice AI model governance, trustworthiness, reliability, and security/data protection.
Organizations can use these frameworks separately or in combination to gain a comprehensive view on managing AI cybersecurity risk, identifying and mitigating vulnerabilities, tracking compliance, and building stakeholder trust. This perspective can help you answer the question, “Where do we even start with managing AI risk?”
What’s next?
For more guidance on this topic, listen to Episode 156 of The Virtual CISO Podcast with guest Jason Rebholz, CEO and co-founder of Evoke Security.

