In which I apply basic security decomposition, least privilege, sandboxing, and memory compartmentalization to AI agents, and discover that you can, in fact, make things better in this new world.
Security Principles in Play
The trouble with AI agents isn’t that they are fundamentally insecure, it’s that they are insecure by default. We take decades of understanding about untrusted data cleaning, separation of concerns, zero trust, etc., and chuck AI on top of all of it at once, only then to wonder why our previously secure systems are suddenly vulnerable.
It’s not AI, per se, it’s the application of it in a world where speed of execution is increasingly the only thing we measure. You get what you incentivize, so you don’t typically get security. It’s as simple as that.
But what if we took a breath and tried just a little harder to squeeze AI into a shape that permits these principles to be applied again? That’s what this series, and AgentQ, are trying to answer. In building AgentQ, which can only charitably be described as a reference implementation with some important lessons inside it, I have learned a great deal about how we can, with a little care, implement an AI agent that is traceable, observable, and generally operates on the principle of least privilege. More importantly, we can plausibly accomplish all of this without sacrificing emergent behaviors that make agents so desirable in the first place.
If you decide to go look at the code, be warned that it is anything but production-ready.
To reiterate and make the principles more complete, here are the things that I am tackling:
- Least Privilege: it should not be possible to do more than asked.
- Detection and Response: there must be places in the system that permit you to be surprised, and to act on that.
- Untrusted Input Sanitization: anything untrusted, including from the Internet and an LLM, has a chance to be filtered.
- System Authorization Enforcement: processes calling other processes must have unique credentials.
- User Authorization Enforcement: users should not suddenly grow new capabilities just because AI is involved.
There’s not very much to disagree with in there, really. We want agents to be bounded, to be observable, to be untrusting of random inputs, and to honor upstream privileges. Pretty basic stuff.
AgentQ - Behind Chat There Are Legions
Let’s talk first about what this is not. This is not, even though it has a similar shape, an agent orchestrator. It isn’t a way to farm a bunch of jobs out to parallel workers and have them do things and report back. It does operate on some similar principles, but it is absolutely not competing with these kinds of systems.
Given that the shape is similar, it will require going deep to understand what the real differences are, so strap in. In security, details matter; we’re going to get into some details.
Here is the basic idea:
- You interact with it like a regular chat agent.
- Underneath, it uses a small army of single-purpose agents to do the actual work.
- Each single-purpose agent is fully authenticated and fully locked down to its small set of capabilities.
- Every action is observable and traceable.
Like I said, this has the look, feel, and smell of sub-agent orchestration, but that’s not at all what this is. Multi-agent orchestrators have many general purpose agents running. They might all have different roles (system prompts), but in most cases, the mental model for agent orchestration is “just another copy of Claude”.
This model is effectively peering under the covers of a single agent in an orchestration system, where each agent is this entire picture. This is a crucial point, which is why I’m getting overly repetitive about it. The picture above, everything inside of it, is best thought of as one agent.
The Rubber and the Road
The architecture diagram, as such diagrams go, isn’t terribly exciting. It’s just a bunch of things talking to a competing consumer work queue. This particular queue is pretty awesome and capable, not least because it has inter-queue security baked directly into it, which enables some of the things I’m about to talk about below. It’s also more capable than other similar systems because it has true, built-in inter-queue-and-doc-store transactionality: it enables fully atomic mark-as-done, update-doc, and insert-next-task operations; everything succeeds or fails together. There are no race conditions that could see work committed and the next step lost due to a crash.
That’s kind of an aside, but it’s important on its own merits because the system is always in a consistent state, even if stuff goes very wrong in its environment. I consider that to not just be a correctness or reliability feature, but a security feature, as well. Proper security requires proper reasoning, and proper reasoning is enabled by correctness and reliability.
Where does the rubber meet the road on the security of AgentQ? It’s in a couple of places:
- Each individual Runner is fully sandboxed. Built-in tools don’t work, only the tools that we give it can be used.
- The Supervisor has zero privileges. The most it can do is specify which Runners to invoke, and pass voucher tokens so that they can obtain their own credentials.
- EntroQ’s inter-queue security measures mean that only vouched-for AgentQ Runners can submit completed work.
- Everything done by any part of the system has provenance tokens attached. It’s all traceable. Network isolation is employed.
- Standard authentication and authorization protocols are built in: OIDC, JWKS, JWTs for sessions, Vault support.
Let’s go over how this works.
The Sandbox
In a previous article, I did very little more than hint at MCP as a sandbox for single-purpose agents. This system is where that starts to make sense. In order for an LLM tool (CLI or API) to use MCP as a sandbox, it has to first be configured with a suitable MCP server, and for practical purposes, it has to then be able to configure that MCP server to only expose necessary tools.
The way this is done is through MCP dynamic configuration using the HTTP Streamable protocol. Zooming in on the AgentQ Runner from the picture above, we see that it is composed of three essential pieces:
- AgentQ: a process that receives tasks on its inbox queue from EntroQ.
- Runner: a process that can invoke an LLM tool, such as Claude CLI, or that can call an API.
- MCP: a process that exposes MCP tools based on the headers passed to it.
AgentQ is pretty basic and easy to understand: it is a queue worker. It claims on its inbox, and it pushes results to the supervisor’s inbox. The task that it is given contains things like
- A prompt, with a full (relevant to this sub-task) transcript
- Authorization tokens specific to this work
- A list of allowed MCP tools, in a signed JWT
The AgentQ container (a full pod when deployed in Kubernetes) has no privileges of its own. It can only read from and write to allowed queues on the EntroQ central service, and it can call into the Runner. Note that the link between AgentQ and the Runner can also go through EntroQ, and this is how the implementation will actually work, using EntroQ’s eqlink sidecar to transparently connect them via queueing. This gives them the same security limitations/capabilities that all other EntroQ inter-queue interactions get.
The Runner is the least capable container/pod in the bunch. It has nothing in its file system, no outside network access, and none of its internal tools work. The reference implementation runs Claude CLI using JSON streaming as a demonstration (you have to spin up a container with a mounted volume and log in once to get credentials stored; it’s a pain, but it’s cheap and easy). The Runner can see EntroQ and it can see an MCP service, and that is all. It runs once and shuts down with every request.
The API service sitting in front of it configures it with a pointer to the MCP server, and that MCP configuration knows to pass a special tool-constraint JWT to MCP when making requests. That JWT is signed, and the MCP server is configured to export only what that JWT allows, and only if that JWT validates properly. No JWT, no tools exposed. The listing itself is empty, and any attempted tool call not in the list simply fails. The Runner’s network is locked down to only EntroQ (via eqlink) and the sanctioned MCP service.
The JWT also contains other constraints, such as working directory for code operations.
And that’s it, really. Task comes in, MCP config propagates, only allowed tools can be called, result goes out. The sandbox is pretty tight, about as tight as you can imagine making it, because
- AgentQ has no keys; it just passes things through and does transcript management.
- Runner has no keys or built-in capabilities; it has to use MCP to do literally anything.
- MCP has only the public key and access to Vault; no other credentials are given to it.
The MCP service is capable of a lot. This is a reference implementation, so for the sake of demonstration, it is a monolith with all file access, all internet access, and all tool access. That’s obviously Pretty Bad (TM) from a security perspective, but it’s pretty easy to split it into multiple tool-subset services that get chosen by the right AgentQ instance. For now, we rely on the JWT and careful coding to keep it from going off the rails.
The Supervisor and Harness
The harness is intended to be just like any familiar chat window. Spoiler alert, it isn’t, really, since this is a demo of an idea. I want to wire up something like pi-agent or similar to make it nice, but for now agentq chat is what we get.
The original diagram shows it speaking directly to the supervisor, but that is kind of a lie. What it’s really doing is dropping a task into EntroQ with a session ID. Then any supervisor worker can pick it up. It then waits on a session-specific queue for results.
The Supervisor is responsible for managing overall state and making calls to various single-purpose agents to get its work done. It has access to the private key used to sign JWTs, and it can vouch for the downstream MCPs so that they can, as needed, obtain credentials from Vault to do their work. The Supervisor does things like mint auth tokens and handle rotation, then it packages them up into the tasks that go to individual AgentQ Runners.
It has to maintain a hierarchy of session context: each of the individual runners only needs the prompt information relevant to what it is doing. The Supervisor maintains a master transcript that contains basic information like “tool ran, gave good output” (highly reductive - there is nuance there, and opportunity for much more sophistication), and it maintains pointers to individual Runner transcripts. When the Harness displays the context tree, it has the option of flattening it for the user so that things look like normal tool calls.
Reliability
A common complaint with agent orchestration systems is lost or duplicated work. This is a real issue when you are giving something minimal attention because it’s desirable to have it do a lot for you. EntroQ, and the careful dance between the pieces of this system, make that much less a problem, because you can commit work, update a doc store, and create new tasks all in one go. Taking just the harness example, if a user submits a prompt, it just goes into a queue. The Supervisor picks it up and holds a lease on it, but only deletes it in the same transaction that it makes its reply in. If the Supervisor crashes (or is scaled down), the original task is picked up again, and the reply only comes through if the original is deleted simultaneously.
This means that work can be completed multiple times (making idempotent operations an important consideration), but it can’t be committed multiple times. It also means that abandoned work is never lost. If something dies while doing processing, that task is not lost, it will get picked up when its lease expires in a few seconds.
This is a critical feature of the system, because it makes the various pieces of the system very simple. They just accept work, do work, and delete/update when ready to commit. Reliability, in the sense of “do this exactly once” is outsourced to EntroQ, where the truly subtle logic lives.
Transitive Permissions
The astute among you will notice that the Supervisor, by virtue of being able to invoke any single-purpose agent, still has access to all the power. It doesn’t hold its own credentials, but it can make them. It doesn’t have any rights to system mutation, but it can invoke them. It doesn’t see any MCP tools, but it can specify that another process can use them.
This is unavoidable, and not hugely different from how things have been since forever. It’s just that previously, it was you. And it was whatever computer or phone you were using. There has to be a single point of trust somewhere in the system.
What makes it secure isn’t that nothing at all bad can happen, but that you have a degree of control over and visibility into what happens. The Supervisor is not what gives you that, at least not in isolation. The entire system is set up to give you that. The queueing system allows you to trace requests from one part to another. The fact that so much of the system is doing trusted work (Supervisor, AgentQ) means that making audit logs is deterministic and reliable.
Can you still get prompt injected? Sure, and frankly that’s always going to be possible as long as untrusted input can cause actions. With LLMs, everything is an instruction, so you can’t even apply mitigations that work for SQL injection, for example. At least with SQL there is a difference between instructions and data.
Given that, and the arms race that it creates, we’re not going to get all the way there without fundamental changes to how LLMs work, right down to neural network structure; there is just no way to make them better without imposing some structure on them, and this is that type of structure.
What I Learned
In building this system, I learned some very interesting things.
Prompt Transcripts Are Tricky
Any engineer working on AI systems could probably have saved me a bunch of time by telling me this, but I had to learn it from first principles: prompt transcripts are easy to get wrong, and can cause some really crazy failure modes.
If you didn’t already know, LLMs don’t just take turns with you. What’s happening behind the scenes is usually a full replay of everything that has happened since the beginning of your session. Every interaction submits the entire transcript for the LLM’s consideration.
There are some nuances here, and there is some caching in some cases, especially for tool calls, but ultimately everything goes up every time. The chat interface gives you the illusion of turn-based conversation, but the agent actually has no memory like you do; it remembers things because they are replayed every time, or because they are written to a file (the contents of which are replayed every time).
This is fairly easy to reason about when you have a single process doing all of the interaction, but becomes truly weird when you have sub-processes doing work for you; if not careful, you can have a transcript spawn more and more of the same action over and over again. I ran into this failure mode when I had separated “task submitted” from “task completed” into two blocks. As the replay was happening, it saw “task submitted” and no immediate “task completed” in the transcript, so before reading the next block, it kicked off the task again!
This would probably not happen with the API, where one has more complete control over transcript handling, but when using a CLI tool that has its own run loop, it is an issue. Sending a task is accomplished via a stripped-down MCP capability, and it just dutifully ran it again and again.
Login is a Mess
Companies are incentivized to make login difficult for their consumer tools. If you need the kind of control I’m looking for, they want to push you to their more expensive API tokens. This is, in a very real sense, unavoidable. But for a proof of concept, it didn’t make any sense. I therefore stuck with CLI tools and shoehorned them into a sort of API-like thing.
It’s important to note that in the process of doing this, I didn’t do anything untoward like pi-agent does (it did something with auth tokens that Google did not like and got me banned from Gemini for a time, and I was exceedingly unhappy about it). I used only official tools, with official login flows. It won’t scale, but it is great for a reference or demo.
API tokens are pricey, and generally overkill for a single developer, so we do what we can.
Stateless is Maybe Slower, but Definitely Better
When each CLI invocation is one-shot, things do get somewhat slower. At least, they feel slower. Claude CLI startup is actually pretty fast, so it might just be perception.
But stateless gives you some great system properties. It’s easy to reason about a transcript. It’s hard to reason about hidden state in a full run loop. Sometimes the juice is worth the squeeze.
Stateless MCP is also a big win. The original SSE protocol (Server-Sent Events) is not stateless and clients have to implement a bit of a state machine to work with it. The newer HTTP Streamable approach is much cleaner, and allows things like “validate JWT on every call” to work seamlessly.
When you can, go stateless every time.
MCP is Not Hard, and Provides Great Primitives
The MCP protocol is actually pretty simple, and there are great libraries out there for implementing your own. That was a very important part of this project; the ability to create my own MCP service with properties that I needed, properties that I hadn’t really seen before, was a big win.
The fact that MCP is basically universal is both a security curse, and in this case, a security blessing. A universal protocol can create opportunities for credential aggregation, which is very much not ideal. But it can also create opportunities for sandboxing; create your own, strip out everything else, and you are pretty much guaranteed that the LLM agent will be able to use the tools you specify.
That was a big win, and more is said about it in an earlier article. If you have the ability to register tools, then you usually have the ability to register only the tools you need. That’s pretty cool.
Configuration Triggers Fundamental Complexity
It’s all well and good that we can compartmentalize the work, but at some point the supervisor has to be able to assemble intent into a sequence of calls. At some point it has to know what types of agents are available to it. At some point it has to perform a partitioning of the prompt space and convert it to possibly-overlapping capabilities.
And at some point, you or an administrator must decide which capabilities are allowed to coexist within a runner. How do you configure that sanely?
It’s tough, and I have admittedly punted that down the road a bit. The way I think about it, though, is as capability groupings. It’s probably fine to group all kinds of git reads together. They should be okay coexisting. It’s probably fine to group various internal document access tasks together, modulo credential enforcement for different ownership. It’s almost certainly not fine to group internet access with system mutation.
The principles are there, and it’s quite possible to get reasonable security by just winging it in this case: reads not coupled with writes, internet access with some kind of summarization but never with financial access, etc.
This is an open question, and something I’m currently fiddling with on this project. It’s anything but done, except in my mind, and what’s there is now what’s here.
Is This Really Novel?
Probably not. I do think that the specific combination of things is somewhat new in this space, but you can’t point at any single piece of it and say, “Wow, that’s new.” It just isn’t. All of the bricks were already there.
I do feel like the concept of using MCP in a sandbox context is pretty interesting and possibly hasn’t been done before in this way, especially when combined with JWT as a mechanism for limiting tool exposure. I believe that most queueing systems don’t have such tight auth and integration to standard tools like OPA, like EntroQ does. I don’t think I’ve seen systems where the supervisor itself can scale horizontally like everything else. So there’s maybe some emergent newness here.
But even if none of this is new and I find out tomorrow that I pulled a Leibniz (or Newton, depending on who wrote the history) and reinvented something someone else already did, the things I learned from this are valuable. It’s been fun, it’s been rewarding, and it’s been a catalyst for other ideas like Engram, which I built while doing this work; it’s my daily driver for Claude memory, and that alone was worth the effort.
Novel or not, it excites me to see that there are indeed ways to make AI agents safer and more secure, and I’m happy to see that a system like this can be useful and have so many interesting properties, even if it really is only a demo with a lot of potential.
Maybe that will be true for you, too!