计算机使用代理

低风险

作者 @sickn33已验证来源

4.7287 次安装v1.0.0更新于 2026年5月25日

使用方式

在 Claude Code 中运行以下命令

第一步：添加 Marketplace

/plugin marketplace add sickn33/antigravity-awesome-skills

第二步：安装插件

/plugin install computer-use-agents@antigravity-awesome-skills

关于

构建像人类一样与计算机交互的 AI 代理 — 屏幕理解、鼠标键盘操作和任务自动化。

name: computer-use-agents description: 构建像人类一样与计算机交互的 AI 代理——查看屏幕、移动光标、点击按钮和输入文本。涵盖 Anthropic Computer Use、OpenAI Operator/CUA 和开源替代方案。 risk: unknown source: vibeship-spawner-skills (Apache 2.0) date_added: 2026-02-27

计算机使用代理

构建像人类一样与计算机交互的 AI 代理——查看屏幕、移动光标、点击按钮和输入文本。涵盖 Anthropic Computer Use、OpenAI Operator/CUA 和开源替代方案。重点关注沙箱隔离、安全性以及处理基于视觉控制的独特挑战。

模式

感知-推理-行动循环

计算机使用代理的基本架构：观察屏幕、推理下一步动作、执行动作、重复。该循环通过迭代流水线将视觉模型与动作执行集成在一起。

关键组件：

感知：截图捕获当前屏幕状态
推理：视觉-语言模型分析并规划
行动：执行鼠标/键盘操作
反馈：观察结果，继续或纠正

关键洞察：视觉代理在"思考"阶段（1-5 秒）完全静止，产生可检测的暂停模式。

适用场景：从零构建任何计算机使用代理，将视觉模型与桌面控制集成，理解代理行为模式

from anthropic import Anthropic
from PIL import Image
import base64
import pyautogui
import time

class ComputerUseAgent:
    """
    Perception-Reasoning-Action loop implementation.
    Based on Anthropic Computer Use patterns.
    """

    def __init__(self, client: Anthropic, model: str = "claude-sonnet-4-20250514"):
        self.client = client
        self.model = model
        self.max_steps = 50  # Prevent runaway loops
        self.action_delay = 0.5  # Seconds between actions

    def capture_screenshot(self) -> str:
        """Capture screen and return base64 encoded image."""
        screenshot = pyautogui.screenshot()
        # Resize for token efficiency (1280x800 is good balance)
        screenshot = screenshot.resize((1280, 800), Image.LANCZOS)

        import io
        buffer = io.BytesIO()
        screenshot.save(buffer, format="PNG")
        return base64.b64encode(buffer.getvalue()).decode()

    def execute_action(self, action: dict) -> dict:
        """Execute mouse/keyboard action on the computer."""
        action_type = action.get("type")

        if action_type == "click":
            x, y = action["x"], action["y"]
            button = action.get("button", "left")
            pyautogui.click(x, y, button=button)
            return {"success": True, "action": f"clicked at ({x}, {y})"}

        elif action_type == "type":
            text = action["text"]
            pyautogui.typewrite(text, interval=0.02)
            return {"success": True, "action": f"typed {len(text)} chars"}

        elif action_type == "key":
            key = action["key"]
            pyautogui.press(key)
            return {"success": True, "action": f"pressed {key}"}

        elif action_type == "scroll":
            direction = action.get("direction", "down")
            amount = action.get("amount", 3)
            scroll = -amount if direction == "down" else amount
            pyautogui.scroll(scroll)
            return {"success": True, "action": f"scrolled {direction}"}

        elif action_type == "move":
            x, y = action["x"], action["y"]
            pyautogui.moveTo(x, y)
            return {"success": True, "action": f"moved to ({x}, {y})"}

        else:
            return {"success": False, "error": f"Unknown action: {action_type}"}

    def run(self, task: str) -> dict:
        """
        Run perception-reasoning-action loop until task complete.

        The loop:
        1. Screenshot current state
        2. Send to vision model with task context
        3. Parse action from response
        4. Execute action
        5. Repeat until done or max steps
        """
        messages = []
        step_count = 0

        system_prompt = """You are a computer use agent. You can see the screen
        and control mouse/keyboard.

        Available actions (respond with JSON):
        - {"type": "click", "x": 100, "y": 200, "button": "left"}
        - {"type": "type", "text": "hello world"}
        - {"type": "key", "key": "enter"}
        - {"type": "scroll", "direction": "down", "amount": 3}
        - {"type": "done", "result": "task completed successfully"}

        Always respond with ONLY a JSON action object.
        Be precise with coordinates - click exactly where needed.
        If you see an error, try to recover.
        """

        while step_count < self.max_steps:
            step_count += 1

            # 1. PERCEPTION: Capture current screen
            screenshot_b64 = self.capture_screenshot()

            # 2. REASONING: Send to vision model
            pass

兼容工具

Claude CodeCursor

计算机使用代理

关于

计算机使用代理

模式

感知-推理-行动循环

兼容工具

标签

相关推荐

RAG系统工程师

批量重构编排

Docx 文档处理

Azure AI Agents Java SDK

Azure Search 文档搜索

Azure AI Agent框架