
关于
构建像人类一样与计算机交互的 AI 代理 — 屏幕理解、鼠标键盘操作和任务自动化。
name: computer-use-agents description: 构建像人类一样与计算机交互的 AI 代理——查看屏幕、移动光标、点击按钮和输入文本。涵盖 Anthropic Computer Use、OpenAI Operator/CUA 和开源替代方案。 risk: unknown source: vibeship-spawner-skills (Apache 2.0) date_added: 2026-02-27
计算机使用代理
构建像人类一样与计算机交互的 AI 代理——查看屏幕、移动光标、点击按钮和输入文本。涵盖 Anthropic Computer Use、OpenAI Operator/CUA 和开源替代方案。重点关注沙箱隔离、安全性以及处理基于视觉控制的独特挑战。
模式
感知-推理-行动循环
计算机使用代理的基本架构:观察屏幕、推理下一步动作、执行动作、重复。该循环通过迭代流水线将视觉模型与动作执行集成在一起。
关键组件:
- 感知:截图捕获当前屏幕状态
- 推理:视觉-语言模型分析并规划
- 行动:执行鼠标/键盘操作
- 反馈:观察结果,继续或纠正
关键洞察:视觉代理在"思考"阶段(1-5 秒)完全静止,产生可检测的暂停模式。
适用场景:从零构建任何计算机使用代理,将视觉模型与桌面控制集成,理解代理行为模式
from anthropic import Anthropic
from PIL import Image
import base64
import pyautogui
import time
class ComputerUseAgent:
"""
Perception-Reasoning-Action loop implementation.
Based on Anthropic Computer Use patterns.
"""
def __init__(self, client: Anthropic, model: str = "claude-sonnet-4-20250514"):
self.client = client
self.model = model
self.max_steps = 50 # Prevent runaway loops
self.action_delay = 0.5 # Seconds between actions
def capture_screenshot(self) -> str:
"""Capture screen and return base64 encoded image."""
screenshot = pyautogui.screenshot()
# Resize for token efficiency (1280x800 is good balance)
screenshot = screenshot.resize((1280, 800), Image.LANCZOS)
import io
buffer = io.BytesIO()
screenshot.save(buffer, format="PNG")
return base64.b64encode(buffer.getvalue()).decode()
def execute_action(self, action: dict) -> dict:
"""Execute mouse/keyboard action on the computer."""
action_type = action.get("type")
if action_type == "click":
x, y = action["x"], action["y"]
button = action.get("button", "left")
pyautogui.click(x, y, button=button)
return {"success": True, "action": f"clicked at ({x}, {y})"}
elif action_type == "type":
text = action["text"]
pyautogui.typewrite(text, interval=0.02)
return {"success": True, "action": f"typed {len(text)} chars"}
elif action_type == "key":
key = action["key"]
pyautogui.press(key)
return {"success": True, "action": f"pressed {key}"}
elif action_type == "scroll":
direction = action.get("direction", "down")
amount = action.get("amount", 3)
scroll = -amount if direction == "down" else amount
pyautogui.scroll(scroll)
return {"success": True, "action": f"scrolled {direction}"}
elif action_type == "move":
x, y = action["x"], action["y"]
pyautogui.moveTo(x, y)
return {"success": True, "action": f"moved to ({x}, {y})"}
else:
return {"success": False, "error": f"Unknown action: {action_type}"}
def run(self, task: str) -> dict:
"""
Run perception-reasoning-action loop until task complete.
The loop:
1. Screenshot current state
2. Send to vision model with task context
3. Parse action from response
4. Execute action
5. Repeat until done or max steps
"""
messages = []
step_count = 0
system_prompt = """You are a computer use agent. You can see the screen
and control mouse/keyboard.
Available actions (respond with JSON):
- {"type": "click", "x": 100, "y": 200, "button": "left"}
- {"type": "type", "text": "hello world"}
- {"type": "key", "key": "enter"}
- {"type": "scroll", "direction": "down", "amount": 3}
- {"type": "done", "result": "task completed successfully"}
Always respond with ONLY a JSON action object.
Be precise with coordinates - click exactly where needed.
If you see an error, try to recover.
"""
while step_count < self.max_steps:
step_count += 1
# 1. PERCEPTION: Capture current screen
screenshot_b64 = self.capture_screenshot()
# 2. REASONING: Send to vision model
pass
兼容工具
Claude CodeCursor
标签
AI与机器学习