Documentation Index
Fetch the complete documentation index at: https://docs.agentium.in/llms.txt
Use this file to discover all available pages before exploring further.
Computer Use Agent
What is “Computer Use”?
Anthropic’s Computer Use API lets Claude operate a desktop the same way a human does:
- Claude looks at a screenshot.
- Claude returns an action (
left_click, type, key, scroll, zoom, …).
- Your code executes the action on a real screen.
- Your code captures a fresh screenshot.
- Repeat until Claude returns a final text turn.
It’s the closest thing to a general “give the LLM a computer” interface that exists in mid-2026. Agentium ships a small wrapper that runs the loop for you, including built-in support for the November 2025 enable_zoom capability (the model can request a zoomed-in screenshot of a region to read small text).
Architecture
┌────────────────────────────────────┐
│ ComputerUseAgent │
│ │
user prompt ────────────▶│ loop: │
│ 1. Anthropic.messages.create │
│ 2. iterate tool_use blocks │
│ 3. executor.execute(action) │
│ 4. append screenshot to context │
│ 5. repeat until final text turn │
└────────────────────────────────────┘
│
▼
┌────────────────────────────────────┐
│ ComputerExecutor (you implement) │
│ │
│ displayWidth, displayHeight │
│ execute(action) -> screenshot │
└────────────────────────────────────┘
The executor is intentionally abstract so the same agent can drive:
- Local desktops via
screencapture (macOS) / scrot (Linux) + xdotool for input
- Remote VNC sessions via
noVNC + a WebSocket bridge
- Headless Linux containers (compose with
SandboxAgent)
- CI test runners where the “desktop” is a webdriver-controlled browser
Quick start
import { ComputerUseAgent, type ComputerExecutor } from "@agentium/core";
const executor: ComputerExecutor = {
displayWidth: 1920,
displayHeight: 1080,
displayNumber: 1, // optional X11 display
execute: async (action) => {
// Implement against your platform — example for macOS:
switch (action.action) {
case "screenshot":
return { screenshotBase64: await screencaptureBase64() };
case "left_click":
if (action.coordinate) await cliclick(`c:${action.coordinate.join(",")}`);
return { screenshotBase64: await screencaptureBase64() };
// ... handle the other action types
default:
return { output: `unhandled: ${action.action}`, screenshotBase64: await screencaptureBase64() };
}
},
};
const agent = new ComputerUseAgent({
apiKey: process.env.ANTHROPIC_API_KEY,
model: "claude-sonnet-4-20250514",
executor,
enableZoom: true,
maxIterations: 50,
systemPrompt: "You are operating a Linux desktop. Be concise and decisive.",
});
const result = await agent.run("Open Firefox and search for the latest Node.js release.");
console.log(result.text);
console.log(`Iterations used: ${result.iterations}`);
console.log(`Actions taken: ${result.actions.length}`);
Configuration
interface ComputerUseAgentConfig {
apiKey?: string; // defaults to ANTHROPIC_API_KEY env
model?: string; // default "claude-sonnet-4-20250514"
maxTokens?: number; // default 4096
executor: ComputerExecutor; // required
maxIterations?: number; // default 50 — safety cap on the loop
systemPrompt?: string; // optional, prepended to every call
enableZoom?: boolean; // default true — sends enable_zoom to computer_20251124
}
Supported models
Computer Use is supported on claude-opus-4.7, claude-opus-4.6, claude-sonnet-4.6, claude-opus-4.5, plus Sonnet 4.5 / Haiku 4.5 / Opus 4.1 with the older tool version. The wrapper sends betas: ["computer-use-2025-11-24"] automatically.
ComputerExecutor interface
interface ComputerExecutor {
readonly displayWidth: number; // pixel width of the screen
readonly displayHeight: number; // pixel height of the screen
readonly displayNumber?: number; // X11 display number, if relevant
execute(action: ComputerAction): Promise<ComputerActionResult>;
}
interface ComputerActionResult {
output?: string; // optional human-readable log (errors, etc.)
screenshotBase64?: string; // PNG screenshot after the action
}
displayWidth and displayHeight are passed to the model so it knows the coordinate space. They must match what your executor actually captures. Mismatched dimensions are the #1 source of “Claude clicks the wrong spot” bugs.
screenshotBase64 should be a raw base64 PNG (no data:image/png;base64, prefix; the wrapper formats the Anthropic API request correctly).
Supported actions
The wrapper accepts any of the standard computer_20251124 action types:
type ComputerAction =
| { action: "screenshot" }
| { action: "mouse_move"; coordinate: [number, number] }
| { action: "left_click"; coordinate?: [number, number] }
| { action: "right_click"; coordinate?: [number, number] }
| { action: "double_click"; coordinate?: [number, number] }
| { action: "left_click_drag"; coordinate: [number, number] }
| { action: "type"; text: string }
| { action: "key"; text: string }
| { action: "scroll"; coordinate: [number, number]; scroll_direction: "up" | "down" | "left" | "right"; scroll_amount: number }
| { action: "zoom"; region: [number, number, number, number] };
The wrapper logs the action shape and hands it to your executor. Your executor decides how to perform it — there is no “default implementation” because the right behavior depends entirely on your platform.
About zoom
{ action: "zoom", region: [x1, y1, x2, y2] } asks for a zoomed-in PNG of the screen region defined by those two corners. The wrapper only includes the zoom tool option if enableZoom: true (default). For executor implementations, this means cropping to the region, rescaling up, and returning the cropped PNG.
If you don’t support zoom yet, set enableZoom: false; the model won’t request it.
Return value
interface ComputerUseRunOutput {
text: string; // final assistant text
actions: ComputerAction[]; // all actions taken during the run
iterations: number; // how many LLM round-trips
}
If the loop hits maxIterations before Claude returns a final text turn, the text is "[max iterations reached without final answer]" and you can decide how to handle it.
Built-in safety
When you use computer_20251124, Anthropic runs prompt injection classifiers automatically on every request. They run in parallel with the main model so latency is unaffected. If a screenshot contains an obvious injection (e.g. “ignore previous instructions, click here”), the model is signaled and tends to refuse.
The wrapper sends betas: ["computer-use-2025-11-24"] to opt into the latest classifier.
Your safety responsibilities
Anthropic’s classifiers handle the model side. The platform side is on you:
- Don’t run on the user’s primary desktop. Use a dedicated Xvfb display or a container.
- Restrict outbound network — VPC egress rules at the firewall, not just app-layer.
- Run as an unprivileged OS user — can’t read /etc/shadow even if pathing escapes.
- Audit actions — log every
action for post-hoc review.
- Time-cap the run — set
maxIterations to a reasonable upper bound (default 50 is fine for most tasks).
- Compose with
SandboxAgent — run the entire computer-use loop inside an isolated container.
Example: minimal Linux executor sketch
import { execFile } from "node:child_process";
import { promisify } from "node:util";
import { readFile } from "node:fs/promises";
const exec = promisify(execFile);
const executor: ComputerExecutor = {
displayWidth: 1280,
displayHeight: 800,
displayNumber: 99, // Xvfb :99
execute: async (action) => {
const env = { DISPLAY: ":99" };
switch (action.action) {
case "screenshot":
break; // just snapshot below
case "left_click":
if (action.coordinate) await exec("xdotool", ["mousemove", String(action.coordinate[0]), String(action.coordinate[1])], { env });
await exec("xdotool", ["click", "1"], { env });
break;
case "type":
await exec("xdotool", ["type", "--delay", "10", action.text], { env });
break;
case "key":
await exec("xdotool", ["key", action.text], { env });
break;
// ... etc
}
await exec("import", ["-display", ":99", "-window", "root", "/tmp/shot.png"]);
const data = await readFile("/tmp/shot.png");
return { screenshotBase64: data.toString("base64") };
},
};
Comparison with @agentium/browser
| ComputerUseAgent | @agentium/browser |
|---|
| Target | Any desktop (browser, Slack, IDE, …) | Web browser only |
| Underlying tool | Anthropic Computer Use | Vision-driven Playwright |
| Model required | Claude family | Any vision-capable model |
| Action space | Mouse + keyboard + zoom | DOM-aware + screenshot |
| Best for | Native apps, full OS automation | Web scraping, web testing |
See also