I was chatting with a former colleague of mine, Michael Wasielewski over at Generative Security, about GenAI security. He has some interesting ideas around ways to secure the usage of Gen AI but we also got chatting about how we can apply some general security approaches and principles to agentic AI, in particular things like Zero Trust and Secure by Design. There are some good, accessible, overviews of Agentic AI from IBM and nVidia that provide a bit of context for the rest of this post. I’m putting this post out on my personal blog because these are very much personal musings rather than anything that I’ve bounced around with my more learned colleagues at my employer, any blunders in the below are purely my own. (This blog is called Security ¦ Life ¦ Musings for a reason – and this post is very much of the musings genre 😊).
When thinking about how to apply Secure by
Design and Zero Trust principles, e.g. Assume Breach, to agentic AI, it’s
important to understand the architecture that we are trying to secure. Consider
the tooling architecture shown below (courtesy of Agentic
Architectures: Securing the Future of AI with Zero Trust – Generative Security):
We need to consider a number of different architecture
layers and levels, and I don’t just mean the Planning, Tool and Validation
agents shown above. Each of those agents will have inputs, instructions and
context (system prompts), and outputs. If thinking about your threat assessment
as part of Secure by Design, then you need to think about who can attack each
agent, at each level, and the attack vectors available to them. Another key
element to consider here is Time of Check vs Time of Use. If you only do your
security checks once, at the entry into the pipeline, what assurance do you
have that the agents cannot be manipulated into producing malicious prompts
further down the chain? What happens if your agents understand encoding
differently or apply different filters to the prompt than that used in your
“front door”? Can a bad guy influence the prompts generated within the agentic
flows? Can a bad guy influence the flows defined in the Planning Actor? Perhaps
even re-directing to agents or RAG tooling under their control
(Agent-in-the-middle anyone?)?
“Assume breach” should apply to all agents,
the initial prompt AND any prompts generated within the flow. This does not
necessarily mean that you need to proxy each and every prompt within a flow -
this would come with a heavy performance overhead - but that you should be
aware of this risk and manage it appropriately. One potential way of addressing
this may be to allocate trust weightings to your agents, based on how much
effort has been put into securing those agents and the assurance of those security
efforts. Perhaps some form of “taint” could be applied to prompts that are
received from those agents with a low assurance weighting? I’d suggest that
this taint approach could also be useful if dealing with agents known to suffer
from high levels of hallucination. Perhaps another factor could be applied to
prompts received from agents known to suffer higher than normal rates of
hallucination? Your trust in the final output would then be weighted by these
two assurance and hallucination taint factors. This approach does open up
another interesting consideration though, as to whether the concept of factors
relating to security assurance and rate of hallucination places limits on the
lengths of the agentic tool-chains that are capable of providing trusted,
reliable, outputs. As with any resilience/availability style calculation, if
your key factors indicate less than 100% reliability, you can rapidly get to a
place where the end result is less than ideal when you chain the components
together. (Of course, use of weightings/factors like these are indicative of
the wider move away from binary trusted/untrusted thinking towards more
context-based security decision-making based on thresholds - back towards that
zero trust way of thinking).
What about the agents themselves? Do you
know what system prompts are applied to provide some guardrails to the user
prompts they receive and process, or who is able to amend those system prompts?
Are those prompts within your control or are you using a service/agent/model
provided by a third party? How much trust do you have in that third party? How
much visibility do you have of the change control approach to those system
prompts? In line with the theme of this post, we should apply the “assume
breach” principle more widely – including to any third party agents, their
vendors and their users (e.g. if the backend model trained on user prompts then
there’s some model poisoning risk). From
a more traditional security perspective, we also need to consider how the
agents authenticate to each other and the authorisation controls needed to
provide the guardrails you might expect around approved usage of each agent. If
we are talking about authentication and authorisation, then if follows that we also
need to consider what we do around agent identity and entitlement management.
And then we get to the outputs. Aggregation
has long been a tricky issue in information security – bringing together bits
of information that are, by themselves, perfectly benign but which, when
brought together, may suddenly become very sensitive. In the UK HMG context,
this could be where you have lots of individual OFFICIAL data items which when
brought together represent SECRET or above through either inference (i.e. if fact
a and fact b are true, then so is fact c! Where facts a and b are both OFFICIAL,
but the combination of the two (fact c) is significantly more highly
classified) or simply through the number of items – think the impact of losing one
tax record vs the impact of losing tax records of the nation. So, one
consideration would be whether the agents are pulling together information that
is significantly more sensitive in aggregate than the component parts. Another
consideration here may include whether an attacker has been able to misuse the
agents to generate output that is not relevant to the designed purpose of the
system – resource misuse. A further
consideration may be whether an attacker is able to use the system to direct a
user to a malicious destination, perhaps by tricking a part of the system into
recommending a visit to a malicious URL under their control. The Validation
Actor in Michael’s diagram above has some important work to do!
So far, I’ve been taking a more architecture-based
approach to discussing these issues, looking at the components within the diagram
above. If you read through the articles
that I linked to at the start of this post, you’ll have seen that the likes of
IBM and nVidia also talk about Agentic AI through the lens of process, e.g. the
nVidia post says:
“Agentic AI uses a
four-step process for problem-solving:
- Perceive: AI agents gather and process data from various
sources, such as sensors, databases and digital interfaces. This involves
extracting meaningful features, recognizing objects or identifying
relevant entities in the environment.
- Reason: A large
language model acts
as the orchestrator, or reasoning engine, that understands tasks,
generates solutions and coordinates specialized models for specific
functions like content creation, visual processing or recommendation
systems. This step uses techniques like retrieval-augmented
generation (RAG)
to access proprietary data sources and deliver accurate, relevant outputs.
- Act: By integrating with external tools and software via application
programming interfaces, agentic AI can quickly execute tasks based on the
plans it has formulated. Guardrails can be built into AI agents to help
ensure they execute tasks correctly. For example, a customer service AI
agent may be able to process claims up to a certain amount, while claims
above the amount would have to be approved by a human.
- Learn: Agentic AI continuously improves through a feedback loop, or
“data flywheel,” where the data generated from its interactions is fed into the system to enhance models. This ability to adapt and become more effective over time offers businesses a powerful tool for driving better decision-making and operational efficiency.”
Now, this post is
already a little longer than I was intending when I began it, and so I am not
going to repeat the task of applying security principles through this process
lens – but I hope that it’s clear that you can. But I will pick up on that last
point “Learn”, as that is one that isn’t straightforward to map on to the
architecture components. Agentic AI has the capability to learn so it improves
its performance against the expected objectives – it’ll do this via a reward
structure where certain behaviours are encouraged but others are discouraged.
Given the nature of this post, you can probably guess where I’m going with this
point. Who controls the reward structures? Is it possible for an attacker to
game those reward structures so as to lead the agentic AI towards the preferred
outcomes of the attacker rather than the owner of the system?
And with that, it’s time to bring this post
to a conclusion. I’d like to think that I’ve demonstrated that there is value
in applying general security principles such as “assume breach” and Secure by
Design thinking to the world of agentic AI, and that securing such systems is
unlikely to result in binary decisions that an outcome is trusted or untrusted,
secure or insecure, reliable or not. Furthermore, that it is important to get
the level of abstraction right when talking about agentic AI systems. It may be
tempting to treat such things as a black box, with a set of inputs and a set of
outputs and some magic that gets us from one to the other. From a security
perspective, I don’t think we can afford to ignore that magic in the middle.
Besides, that’s where the interesting and fun problems sit, so why deny
ourselves the pleasure?
No comments:
Post a Comment