Security
When people talk about AI agents that can write and run code, the conversation usually focuses on the writing part. What LLM is being used? What language? How does it handle edge cases? The running part — the actual execution of model-generated code — gets far less attention. This is a mistake.
Consider what it means to let an AI agent execute arbitrary code. The agent receives a task, reasons about how to accomplish it, writes code to do so, and runs it. In the best case, the code is correct and produces the intended result. In the less-than-best case, any of the following can happen:
None of these are hypothetical. Each of them has happened, and will happen again, to teams that implement code execution without a proper sandbox.
A code sandbox for agent use isn't just a Docker container. It's a set of isolation guarantees:
Most teams that decide to build their own code sandbox start with Docker. Docker is the obvious tool — it provides process and filesystem isolation out of the box. But production-grade sandbox infrastructure for an agent requires more than a Dockerfile:
You need container lifecycle management. A new container per execution is safe but slow. A pool of warm containers is fast but requires careful state management — a container that ran malicious code in the previous request cannot be reused. You need to solve the warm-start vs. clean-state tradeoff.
You need resource limit enforcement that survives adversarial inputs. Standard Docker resource limits are soft limits — they can be evaded by certain execution patterns. Production sandboxes need kernel-level enforcement via cgroups v2.
You need graceful timeout handling. A code execution that times out must terminate cleanly, return a useful error to the model, and not leave zombie processes or leaked resources on the host.
You need output sanitization. Code that prints sensitive data (credentials it discovered through filesystem traversal, environment variables it found) must not have that data surfaced to the model unfiltered.
The most dangerous aspect of the sandboxing problem is how silently it can fail. An agent that executes code without a proper sandbox might work fine in testing, where the code is benign and the environment is controlled. The failures arrive in production, when the code is less benign and the environment matters.
A team that ships "exec the model's code in a subprocess" in production has not shipped code execution — they've shipped a liability. The difference between that and a real sandbox is not visible in any benchmark or demo. It's visible exactly once, when something goes wrong.
For most agent developers, the right approach is the same as with browser automation: treat code execution as a primitive, not a DIY infrastructure problem. The agent calls run_code(language, script). The execution layer handles sandbox provisioning, resource limits, isolation, and returning structured output.
result = await legs.run_code(
language="python",
code="import pandas as pd\ndf = pd.read_csv('/data/report.csv')\nprint(df.describe().to_json())"
)
# Returns: {"stdout": "{...}", "exit_code": 0, "elapsed_ms": 342}
The agent gets a clean result. It never needs to know about cgroups, container pools, or output sanitization. That's exactly where this complexity belongs — hidden behind a well-designed abstraction, maintained by people who think about nothing else.
Agent Legs provides sandboxed code execution as a first-class action type. Python, JavaScript, Bash — with resource limits, process isolation, and structured output. No containers to manage. Get early access.
Free for 1,000 actions/month. No credit card required.
Get early access