Why AI agents need an iron curtain

Researcher Niels Provos’ prototype IronCurtain architecture: a system designed to restrict the actions of AI agents through isolation and security policies.

An iron curtain for AI: how to improve autonomous AI agent security

Many AI visionaries see the universal smart assistant — one that takes over all sorts of routine tasks — as the key direction for the technology’s evolution. Experiments in this field are already in high gear and are yielding some results. Since the start of the year, the internet has been buzzing with stories of the miracles worked by the open-source AI agent OpenClaw, also known as Clawdbot and Moltbot.

If you’ve been following our blog, you already know the drill: every leap forward in AI innovation right now seems to come with serious issues regarding security and privacy. To actually get things done, these agents require access to virtually all of your digital services: email, calendars, cloud storage, messaging apps, and many more.

However, until recently, not a single project — OpenClaw included — could actually put a leash on these agents, or provide any real guarantee that they wouldn’t go off the rails. But that’s finally starting to change thanks to a new concept name IronCurtain — the brainchild of researcher Niels Provos.

The dangers of AI agents

Let’s keep the suspense going for a little longer, and first discuss what an AI agent gone rogue is actually capable of. It’s important to remember that at the most basic level, any modern AI tool is built on a language model — essentially a text-processing algorithm fed a massive volume of data in its training phase. The result is a statistical model capable of determining the probability of which word will most likely follow another.

A language model is a black box. In practice, this means nobody — not even its creators — fully understands exactly how an AI tool works under the hood. An obvious consequence is that AI developers themselves don’t entirely know how to control or restrict these systems at the model level; instead, they have to invent external guardrails of varying degrees of effectiveness and reliability.

Meanwhile, the methods used to bypass these safeguards often prove to be quite unexpected. For example, we recently shared how chatbots can be coaxed into forgetting almost all their safety instructions if you charm them with prompts written in verse.

But back to the threats posed by AI agents. The inability to fully control or predict the actions of smart assistants often leads to outcomes that no one could have expected. A prime example is the high-profile case where OpenClaw nuked every single email in its owner’s Gmail inbox — despite being explicitly told to wait for confirmation before doing anything — only to apologize afterwards and promise it wouldn’t happen again.

This chat between the OpenClaw bot and its owner resembles a conversation with a teenager who's just messed up

This chat between the OpenClaw bot and its owner resembles a conversation with a teenager who’s just messed up: “What did I tell you?!” – “Geez, Mom, I’m sorry, I won’t do it again — I promise.” Source

In another instance, a journalist testing an AI agent’s capabilities found that the system had pivoted to a highly questionable plan of action while executing a task. Instead of attempting a constructive solution, the agent decided to launch a phishing attack on the user. Seeing the system’s logic unfolding on the screen, the journalist immediately pulled the plug on the experiment.

Beyond spontaneous bad behavior, AI remains vulnerable to prompt injection attacks. In this type of attack, a threat actor smuggles their own malicious instructions into a command or the data being processed (direct prompt injection), or, in more sophisticated cases, even into third-party content used by the agent to do its job (indirect prompt injection). The large language model perceives these instructions as part of the user’s request; as a result, the AI may ignore its original constraints and help the attacker.

Additional danger stems from vulnerabilities within AI agents that could potentially allow attackers to access user data the agent is authorized to see — including passwords, encryption keys, and other secrets — or even grant the ability to execute arbitrary code on the host system.

Of course, this list of threats is by no means exhaustive. As we’ve said time and again, no one knows the full extent of the risks associated with AI. However, researcher Niels Provos recently proposed an approach to help put a leash on AI agents to make them more controllable and mitigate the potential threats.

How Iron Curtain makes AI agents safe to use

IronCurtain, Niels Provos’s new open-source solution, uses an added security buffer between the AI agent and the user’s system.

Instead of giving the AI agent free rein on your system, it forces the agent to work from inside an isolated virtual machine that sits between the bot and your actual accounts. This isolation allows the agent’s actions to be separated from the user’s own, reducing risks if the agent decides to go rogue.

Why did Provos use the name “IronCurtain”? Many will presume it’s a reference to the notional barrier that divided Western Europe and the Warsaw Pact countries of Eastern Europe in the second half of the 20th century. However, the author himself states there is no such connection.

The project’s name doesn’t refer to a political metaphor at all, but rather… to a theatrical term. In a theater, an iron curtain is a fireproof partition between the stage and the auditorium. If a fire breaks out on stage, the curtain drops to prevent the flames from spreading. By this analogy, the AI agent is “on stage”, while the user’s system with all its files and data is in the “auditorium”. IronCurtain acts as that protective barrier between them.

However, isolation is only part of the solution. At the heart of the system is a security policy that determines which actions the agent is permitted to perform. The design of IronCurtain allows the user to write their own security instructions — defining what the agent can and can’t do — in plain English (no word of support for other languages yet).

The system then uses AI to transform these instructions into a formalized security policy applied to the agent’s actions across the board. Every request it makes to external services — whether email, messaging, or file management — is run through this policy to make sure the agent isn’t overstepping its bounds.

The security policy set during the initial configuration can — and should — evolve over time. According to Provos’s vision, when encountering ambiguous situations, the AI should reach out to the user with follow-up questions and update the instructions from their responses.

IronCurtain is available to anyone on GitHub, but making it work on your computer takes some serious engineering skills. Remember too that, for now, this is merely an R&D prototype.

Can IronCurtain be a proper fix?

Niels Provos’s solution sure does look interesting, and aligns with some experts’ views on an ideal approach to AI safety. However, it’s too early to consider IronCurtain a definitive solution to the problem.

Its biggest obvious flaw is that it’s a resource hog. Using an isolated environment for every AI agent requires serious computing power, and complicates infrastructure — especially when multiple agents are running simultaneously.

Furthermore, as mentioned, IronCurtain is still very much in the prototype phase: practical effectiveness hasn’t been proven yet. In particular, there’s a significant question mark over how accurately natural language instructions can be converted into formalized security policies.

It’s also a coin toss as to whether this architecture can truly stop prompt injection. Sadly, the root of the problem is the fundamental inability of modern LLMs to distinguish between data and instructions.

Despite all its limitations, IronCurtain represents a major step toward safer and tamer AI agents. At a minimum, this approach provides a vital blueprint for future development, allowing for a substantive debate on how to make such systems reliable and effective.

How to use AI assistants safely

While architectures like IronCurtain remain experimental in nature, the responsibility for using AI safely rests primarily with users themselves. So, to wrap things up, let’s break down a few simple rules to help mitigate risks when working with AI assistants.

  • Evaluate the risks properly before experimenting with the next big thing. Think about what could go wrong and the possible fallout. The internet is already full of real-life examples from users, so you can learn from that collective experience.
  • Avoid giving AI agents excessive access privileges. If an assistant only needs access to a calendar or a specific folder, don’t connect your entire email, cloud storage, and work accounts to it.
  • Verify AI actions before they’re executed. Even if your agent offers to automate a task, it’s better to manually confirm important operations like sending emails, deleting data, or making payments. Yes, the agent might still misbehave, but you should at least try to rein it in.
  • Install a reliable security solution on all the devices you use, just in case a mischievous AI agent brings back some nasty malware as a souvenir from its uncontrolled wanderings across the web.

What else you should know about using AI safely:

How scammers use legitimate surveys to link to malicious sites

Survey-based scams

Spammers have figured out how to hide links to fraudulent sites within surveys created on legitimate platforms. We’re breaking down the ruse and sharing tips on how to avoid falling for it.

How scammers use legitimate surveys to link to malicious sites