When thinking about making AI agents more secure, there are a some important classes into which we can place vulnerabilities. A couple of the more interesting ones are the Confused Deputy problem and Privilege Aggregation. Each of these is an issue on its own, but together they make for a very serious combination.
This post series aims to tackles some of the many dimensions of this space, elucidating what’s wrong, and suggesting some ways in which we might try mitigating them.
This post is about sandboxing, particularly sandboxing single- or limited-purpose agents.
The Problem of Agentic Restriction
In order for our agents to be truly useful, we tend to want them to have a lot of capabilities. That’s great! A lot of capabilities means there’s a lot that they can do.
It is also scary. A lot of capabilities means there’s a lot that they can do that you may not want at any given moment.
Tools taken in isolation are pretty easy to reason about, such as “internet access” or “financial access”. We kind of know what those mean, and you can probably easily come up with some thoughts about how to secure each of them. Put them together, however, and suddenly you have “financial access by anyone clever on the internet”, which is obviously a completely different story.
Prompt injection is the defining vulnerability of our times, and I don’t say that lightly. It triggers the Confused Deputy problem, a very well known issue in security. Essentially it describes what happens when a system that has a lot of access can be tricked into acting on behalf of someone who does not. Because an AI agent has no concept of the difference between data and instructions, every agent is a confused deputy on retainer.
How do you confuse an agent deputy? You inject instructions into a seemingly benign location. It can be as easy as posting agent-confusing text on a public ticket queue on Github, or sending a well crafted spam email. Any agent that reads those tickets or that inbox is now at risk of doing things that have nothing to do with reading tickets or email. It can potentially be induced to do anything you have set it up to be able to do for you, but by someone else and for their own nefarious purposes.
To an agent, everything is an instruction. Your agent can be instructed to do a lot.
This has been the big worry with MCP; because so many things are desirable for agents to help with, there is natural pressure to add MCP services to your agent, and there is nothing really preventing them from acting in concert. Add an MCP for Github, add another for Schwab, add another for email, and your agent now has all of your logins for all of those things in one place. That’s the Privilege Aggregation problem: a single place where the keys to all of your treasure chests are stored. If the agent gets confused by a spam mail, it could easily end up doing unexpected things to your finances: it has those keys.

Restricting agents seems quite important when you start to think about them in this way, but as soon as you try to lock them down directly, all of their utility starts to evaporate. You can, if you are worried, remove the ability to manage your email, but now it can’t do any useful email things. Combining capabilities to get emergent behavior is the very thing that makes agents useful. It is also the thing that makes them extremely difficult to secure, and MCP tends to be the poster child for this, not because it’s a singularly terrible protocol or anything like that, but because it is the de facto enabler of actions, and it’s nearly universal.
Need something done? Put a little MCP in it.
If you have a universally understood protocol, a place where all of your keys live, and a system that can be confused into using any of them, you have a recipe for some very sad days. You have just handed the keys to everything to what amounts to a five-year-old with a massive memory.
Single-Purpose Agents
This is going to take a few posts to really get untangled, but let’s assume for a moment that we can make some headway on this problem by dividing a big, capable agent into a bunch of small, limited agents. This sort of decomposition is standard security design: get a handle on capabilities in isolation, establish trust boundaries between them, and then manage interactions.
Assuming we have a good way to do this (and we do, stay tuned), part of the problem is now reduced to, “How do we keep this single agent from doing anything it isn’t supposed to do?” In other words, a foundational question is now whether we can make a solid sandbox for an agent.
Prompts are not enough for this. Context windows can be flooded. Injected content can override your preferences. Text you can’t even see can really ruin your week, no matter how sternly worded your instructions are to “Never touch banking unless I tell you to directly.”
What’s needed is not better system prompts, what’s needed is something structural.
A coding agent makes for a nice elucidating example. A typical interaction with something like Claude Code involves letting it roam anywhere it wants and do anything it wants on your system. What’s surprising is just how well this seems to work most of the time without true disasters. I have, in my time using it, however, had some very near misses with some very large potential blast radiuses, and I am not alone.
We don’t like near misses, so let’s tighten this down. Suppose we want a small,
single-purpose agent that just does git operations and nothing else. The
typical approach is to create a Docker (or similar) image and strip it of all
but the tools of interest. Essentially “tool off” or “tool on”. If it’s in the
system, it can be called, so in this image we have only git. There are
other approaches that can be used for file access control, network egress, etc.
(e.g., GVisor, EBPF), but if you squint at them right, it turns out that many
of those can be reduced to the same idea: tool availability.
You can’t hit what isn’t there.
MCP as a Disabler
The very thing that enables actions in an agent can be used to “disable” actions. If we have single-purpose agents and our main concern is to tighten up the boundaries of the sandbox, we can simply remove all of its shell-accessible tools and expose MCP tools that we want to allow it to have. Anything not exposed is simply not possible to do.
Let’s go back to the git access idea. If you give the agent access to the
git tool, it can do anything that git allows it to do. That might not be
what you want. You might only want it to do status, diff, and staging,
for example. You might not want it to be allowed to commit. You might even
want to limit which flags can be specified for some of the tools. But in a
container (modulo network credentials) access to git is pretty much all or
nothing. It’s there or it’s not.
What if, instead, you give your agent access to an MCP service that only
implements the git capabilities that you want to allow? Suddenly your
agent doesn’t need any shell access at all. It can run in a locked-down
environment that only allows network egress to a very specific endpoint. The
MCP tools, running nearby, with access to the real git and a chroot volume,
are going to do all the work. The agent just uses them, and crucially, can’t
do anything else because all it has is a single MCP server and the tools that
it exposes.

MCP: the cause of, and solution to, all our problems.
What’s kind of neat is that if the agent knows anything about git, which it
does because of its training data, it can easily infer how to use related MCP
tools with very little coaxing. I’ve seen this in practice; even replacing an
agent’s memory system with something like
Engram is just a matter of a few system
prompts and a new
tool.
It turns out that agents are shockingly good at making these kinds of leaps and
learning how to use new tools to do what they are asked to do. This is one of
the things that enables this security idea in the first place. In the past, if
you had a tool you wanted to run securely, you really needed to lock down a
vast attack surface with tools like GVisor that come with their own challenges.
After all, you have no access to the code you are trying to run, so you just
have to make sure the system itself is as generally constrained as possible.
With agents it’s a different story. If you want them to do git operations,
you just tell them, “Sorry, no, but you have this MCP thing over here that
accomplishes what you need,” and that’s it.
If an agent can be easily induced to use a new memory system, it can definitely
handle some creative git use via MCP tools.
MCP can make locking down single-purpose agents trivial. You can even use an agent to build the service for you. Bespoke work like this is well within reach for even a team of one.
Stay Tuned - Agent Orchestration is Neat
MCP, which in many ways has been something of a security headache, can be used to be a powerful sandboxing tool. MCP as a security solution isn’t something you hear every day, but in the right context, it can actually mitigate the single-tool access problem handily.
The single-tool problem is only one piece of the puzzle, of course. It only makes sense in a larger agent orchestration context. That’s a longer discussion, and has a lot of fun moving parts. I’m currently working on a system that I think has some really interesting security properties, and I’ll share that soon.
It will be using MCP for its sandbox.