Friday, 24 October 2025

Agents and cascading failures

Some Friday musings after reading through the extensive incident report from AWS about their recent outage.

Firstly, let me say that automation is a good thing. It reduces costs, accelerates response to incidents, enables things like self-healing, and lets folks get on with more interesting tasks. I am no Luddite (but I accept I may be getting increasingly curmudgeonly!). However… when it goes wrong, well – it can automate failure too.

Which brings me on to my main concern this morning, the rise of Agentic AI. My suspicion is that the rise of agents will increase the risk of stumbling across the kind of latent race conditions encountered by AWS, whilst the potential for non-deterministic outcomes (where agents are backed by LLMs) may cause the emergence of some particularly weird states when you have multiple de-coupled agents just getting on with their jobs.  Unexpected states are not really where we want to find ourselves from a security perspective, particularly if the creation and shape of those unexpected states can be influenced by an attacker. Unexpected states are also less than ideal from the resilience perspective, as documented in the AWS write-up.

We’re not going to be putting the Agentic AI genie back in the bottle any time soon, so what does this mean we should be doing? I’d suggest multiple things:

  1. Tightly constrain the scope of your agents. The principle of least privilege should apply to your agents as it does to your humans.
  2. Adopt the old Unix philosophy of “Do one thing and do it well” – keep the scope of your agents simple and understandable.
  3. Focus on the initial observation/perception part of the workflow. If things look weird – STOP. Do not proceed. Do not risk cascading failure throughout the wider system.
  4. Circle round on the final state. Does it look sensible? Consider whether you can incorporate roll-back mechanisms.
  5. Log. Observability really matters in these scenarios. Track the actions of the agents so that you maintain traceability and so improve recoverability if the overall system does find itself in a weird state.

I think that’s enough musings for today. Feel free to comment with any other steps that should be considered to help maintain secure and resilient agentic AI supported systems (the purpose of this blog is for me to put some half-formed thoughts out there for discussion - I’m not precious so feel free to call them out as nonsense if you do disagree).

(I’ll finish off by complementing AWS on their transparency, it’s always interesting to peek under the hood of how the platform works!)