Monday, 30 June 2025

Musings on Agentic AI security

I was chatting with a former colleague of mine, Michael Wasielewski over at Generative Security, about GenAI security. He has some interesting ideas around ways to secure the usage of Gen AI but we also got chatting about how we can apply some general security approaches and principles to agentic AI, in particular things like Zero Trust and Secure by Design. There are some good, accessible, overviews of Agentic AI from IBM and nVidia that provide a bit of context for the rest of this post. I’m putting this post out on my personal blog because these are very much personal musings rather than anything that I’ve bounced around with my more learned colleagues at my employer, any blunders in the below are purely my own.  (This blog is called Security ¦ Life ¦ Musings for a reason – and this post is very much of the musings genre 😊).

When thinking about how to apply Secure by Design and Zero Trust principles, e.g. Assume Breach, to agentic AI, it’s important to understand the architecture that we are trying to secure. Consider the tooling architecture shown below (courtesy of Agentic Architectures: Securing the Future of AI with Zero Trust – Generative Security):

A diagram of a agent

AI-generated content may be incorrect.

We need to consider a number of different architecture layers and levels, and I don’t just mean the Planning, Tool and Validation agents shown above. Each of those agents will have inputs, instructions and context (system prompts), and outputs. If thinking about your threat assessment as part of Secure by Design, then you need to think about who can attack each agent, at each level, and the attack vectors available to them. Another key element to consider here is Time of Check vs Time of Use. If you only do your security checks once, at the entry into the pipeline, what assurance do you have that the agents cannot be manipulated into producing malicious prompts further down the chain? What happens if your agents understand encoding differently or apply different filters to the prompt than that used in your “front door”? Can a bad guy influence the prompts generated within the agentic flows? Can a bad guy influence the flows defined in the Planning Actor? Perhaps even re-directing to agents or RAG tooling under their control (Agent-in-the-middle anyone?)?

“Assume breach” should apply to all agents, the initial prompt AND any prompts generated within the flow. This does not necessarily mean that you need to proxy each and every prompt within a flow - this would come with a heavy performance overhead - but that you should be aware of this risk and manage it appropriately. One potential way of addressing this may be to allocate trust weightings to your agents, based on how much effort has been put into securing those agents and the assurance of those security efforts. Perhaps some form of “taint” could be applied to prompts that are received from those agents with a low assurance weighting? I’d suggest that this taint approach could also be useful if dealing with agents known to suffer from high levels of hallucination. Perhaps another factor could be applied to prompts received from agents known to suffer higher than normal rates of hallucination? Your trust in the final output would then be weighted by these two assurance and hallucination taint factors. This approach does open up another interesting consideration though, as to whether the concept of factors relating to security assurance and rate of hallucination places limits on the lengths of the agentic tool-chains that are capable of providing trusted, reliable, outputs. As with any resilience/availability style calculation, if your key factors indicate less than 100% reliability, you can rapidly get to a place where the end result is less than ideal when you chain the components together. (Of course, use of weightings/factors like these are indicative of the wider move away from binary trusted/untrusted thinking towards more context-based security decision-making based on thresholds - back towards that zero trust way of thinking).

What about the agents themselves? Do you know what system prompts are applied to provide some guardrails to the user prompts they receive and process, or who is able to amend those system prompts? Are those prompts within your control or are you using a service/agent/model provided by a third party? How much trust do you have in that third party? How much visibility do you have of the change control approach to those system prompts? In line with the theme of this post, we should apply the “assume breach” principle more widely – including to any third party agents, their vendors and their users (e.g. if the backend model trained on user prompts then there’s some model poisoning risk).  From a more traditional security perspective, we also need to consider how the agents authenticate to each other and the authorisation controls needed to provide the guardrails you might expect around approved usage of each agent. If we are talking about authentication and authorisation, then if follows that we also need to consider what we do around agent identity and entitlement management.

And then we get to the outputs. Aggregation has long been a tricky issue in information security – bringing together bits of information that are, by themselves, perfectly benign but which, when brought together, may suddenly become very sensitive. In the UK HMG context, this could be where you have lots of individual OFFICIAL data items which when brought together represent SECRET or above through either inference (i.e. if fact a and fact b are true, then so is fact c! Where facts a and b are both OFFICIAL, but the combination of the two (fact c) is significantly more highly classified) or simply through the number of items – think the impact of losing one tax record vs the impact of losing tax records of the nation. So, one consideration would be whether the agents are pulling together information that is significantly more sensitive in aggregate than the component parts. Another consideration here may include whether an attacker has been able to misuse the agents to generate output that is not relevant to the designed purpose of the system – resource misuse.  A further consideration may be whether an attacker is able to use the system to direct a user to a malicious destination, perhaps by tricking a part of the system into recommending a visit to a malicious URL under their control. The Validation Actor in Michael’s diagram above has some important work to do!

So far, I’ve been taking a more architecture-based approach to discussing these issues, looking at the components within the diagram above.  If you read through the articles that I linked to at the start of this post, you’ll have seen that the likes of IBM and nVidia also talk about Agentic AI through the lens of process, e.g. the nVidia post says:

“Agentic AI uses a four-step process for problem-solving:

  1. PerceiveAI agents gather and process data from various sources, such as sensors, databases and digital interfaces. This involves extracting meaningful features, recognizing objects or identifying relevant entities in the environment.
  2. Reason: A large language model acts as the orchestrator, or reasoning engine, that understands tasks, generates solutions and coordinates specialized models for specific functions like content creation, visual processing or recommendation systems. This step uses techniques like retrieval-augmented generation (RAG) to access proprietary data sources and deliver accurate, relevant outputs.
  3. Act: By integrating with external tools and software via application programming interfaces, agentic AI can quickly execute tasks based on the plans it has formulated. Guardrails can be built into AI agents to help ensure they execute tasks correctly. For example, a customer service AI agent may be able to process claims up to a certain amount, while claims above the amount would have to be approved by a human.
  4. Learn: Agentic AI continuously improves through a feedback loop, or
    “data flywheel,” where the data generated from its interactions is fed into the system to enhance models. This ability to adapt and become more effective over time offers businesses a powerful tool for driving better decision-making and operational efficiency.”

Now, this post is already a little longer than I was intending when I began it, and so I am not going to repeat the task of applying security principles through this process lens – but I hope that it’s clear that you can. But I will pick up on that last point “Learn”, as that is one that isn’t straightforward to map on to the architecture components. Agentic AI has the capability to learn so it improves its performance against the expected objectives – it’ll do this via a reward structure where certain behaviours are encouraged but others are discouraged. Given the nature of this post, you can probably guess where I’m going with this point. Who controls the reward structures? Is it possible for an attacker to game those reward structures so as to lead the agentic AI towards the preferred outcomes of the attacker rather than the owner of the system?

And with that, it’s time to bring this post to a conclusion. I’d like to think that I’ve demonstrated that there is value in applying general security principles such as “assume breach” and Secure by Design thinking to the world of agentic AI, and that securing such systems is unlikely to result in binary decisions that an outcome is trusted or untrusted, secure or insecure, reliable or not. Furthermore, that it is important to get the level of abstraction right when talking about agentic AI systems. It may be tempting to treat such things as a black box, with a set of inputs and a set of outputs and some magic that gets us from one to the other. From a security perspective, I don’t think we can afford to ignore that magic in the middle. Besides, that’s where the interesting and fun problems sit, so why deny ourselves the pleasure?

No comments: