Open Source · AGPL-3.0 License · v0.1.1

Give AI Eyes & Hands
On Your Desktop

ScreenHand is the open-source MCP server that lets Claude and any AI agent see your screen, click buttons, type text, and control any app — on macOS & Windows. 70+ tools at native speed (~50ms per action).

ScreenHand MCP server giving AI agents eyes and hands to control a desktop computer
~50ms UI Actions
70+ Tools
macOS & Windows
AI vision and screen analysis
The Problem

AI Can Think. But It Can't Do.

Your AI assistant writes brilliant code but can't click a button. It understands workflows but can't switch between apps. It's time to fix that.

3 Core Problems

Why AI Desktop Control Is Broken Today

Screenshot-based AI tools are slow at 2-5 seconds per action
Problem 01

Screenshot AI Is 2-5s Per Action

Capture screen, send to LLM, interpret pixels, guess coordinates, click. Each step takes seconds. A 10-step workflow takes a full minute.

Pixel coordinate clicking breaks when windows resize
Problem 02

Pixel Coordinates Break Constantly

Click at (x,y)? One window resize, one display scale change, and the AI clicks the wrong element. Coordinate guessing is fundamentally unreliable.

Desktop applications are siloed with no universal API
Problem 03

Your Desktop Has No API

Chrome, Excel, Slack, Jira — each app is an isolated silo. AI can't read from one and act in another. Your desktop needs a universal interface.

AI clicking UI elements precisely via native APIs
The Solution

Native APIs. Zero Guessing.
100x Faster.

ScreenHand reads the actual UI tree through OS Accessibility APIs. It knows every button, menu, and text field — instantly.

ScreenHand scanning screen elements via Accessibility APIs and OCR
See Everything

Screenshots + Accessibility Tree

Not just pixels. ScreenHand reads the actual UI element tree through native APIs, plus Vision-framework OCR.

  • Full screenshots with OCR, bounding boxes, element positions
  • Accessibility tree: roles, titles, values — DevTools for any app
  • Find elements by text or role in ~50ms
~600ms OCR · ~50ms UI tree
ScreenHand targeting UI elements precisely by name not coordinates
Click & Type

Target By Name, Not Coordinates

Click buttons by accessibility title. Resize, rearrange — ScreenHand still finds the right element.

  • ui_press("Save") — position-independent
  • Type text, set values, keyboard shortcuts
  • Drag, scroll, full mouse control
~50ms per click · 0 missed clicks
ScreenHand speed comparison showing 50ms vs 2-5 seconds
100x Faster

Native APIs Skip The Screenshot Loop

Screenshot tools: capture → LLM → interpret → guess → click. ScreenHand talks to the OS directly.

  • UI actions: ~50ms vs 2-5 seconds
  • Chrome CDP: ~10ms
  • Memory: ~0ms (O(1) lookup)
100x faster than Computer Use
ScreenHand enabling cross-app workflows between browser and desktop apps
Cross-App

One AI. Every App. Full Desktop.

A universal API for your entire desktop. Read from one app, act in another.

  • Spreadsheet → Chrome → Notes in one flow
  • Slack → Jira → Docs automation
  • Native apps + browser + system menus
  • Stealth mode for bot-protected sites (Instagram, LinkedIn)
  • Platform playbooks with pre-built selectors & error solutions
70+ tools · any app
Use Cases

What People Actually Use It For

Real workflows people automate with ScreenHand every day.

💻

Automate Repetitive Tasks

"Fill out this form on 10 websites" — ScreenHand opens each site, fills the fields, and submits. You watch.

Form filling · Data entry · Batch operations
🔎

Debug UIs Without Clicking Around

Ask Claude to inspect the UI tree, check button states, walk through a flow — all from your terminal. No manual clicking.

UI debugging · Element inspection · State checking
🌐

Browser Automation Without Selenium

Navigate pages, fill forms, run JavaScript, scrape data — through Chrome DevTools Protocol. Works even on sites that block bots.

Web scraping · Form automation · Testing
🚀

Cross-App Workflows

Read from a spreadsheet, search in Chrome, paste into Notes — chain actions across your entire desktop in one command.

Multi-app · Data transfer · Workflow chains

How it works: You tell your AI what to do in plain English. ScreenHand translates that into native OS actions — clicking buttons by name, typing into fields, reading screen content. No scripting needed.

Cross-app desktop automation demo
Demo

Watch AI Control a Desktop in Real Time

Performance

Milliseconds, Not Seconds

~50ms
UI Actions
~10ms
Chrome CDP
~600ms
Screenshot+OCR
2-5s
Screenshot AI (others)
70+ Tools

Complete Desktop Control

Screen Vision

screenshot, screenshot_file, ocr — Full screenshots with OCR and bounding boxes.

App Control

apps, windows, focus, launch, ui_tree, ui_find, ui_press, ui_set_value, menu_click.

Keyboard & Mouse

click, click_text, type_text, key, drag, scroll — full input simulation.

Chrome Browser (CDP)

browser_tabs, open, navigate, js, dom, click, type, wait — full CDP control with React-compatible input events.

Learning Memory

Auto-learns strategies, tracks errors, O(1) recall, background web research for fixes.

AppleScript

Run any AppleScript for deep macOS system integration and app scripting.

Stealth & Anti-Detection

browser_stealth, browser_fill_form, browser_human_click — bypass bot detection on Instagram, LinkedIn, and more with human-like interactions.

Platform Playbooks

platform_guide, export_playbook — pre-built automation guides with selectors, flows, and error solutions. Auto-generate and share playbooks from your sessions.

Works Everywhere

Any MCP Client. 3 Lines of Config.

Claude Desktop
Claude Code
Cursor
Windsurf
Codex CLI
ScreenHand system architecture overview
Architecture

Three Layers. Zero Cloud. All Local.

Everything runs on your machine. No data leaves your desktop.

AI Client

Claude Desktop, Claude Code, Cursor, Codex CLI

↓ MCP Protocol (stdio)

ScreenHand MCP Server

TypeScript — routes 70+ tools, manages sessions, Chrome CDP, stealth & playbooks

↓ JSON-RPC (stdio)

Native Bridge

Swift (macOS) · C# .NET 8 (Windows) — Accessibility APIs

ScreenHand three-layer architecture: AI Client, MCP Server, Native Bridge
Get Started

Running in 60 Seconds

terminal
# Clone & build
git clone https://github.com/manushi4/screenhand.git
cd screenhand && npm install
npm run build:native        # macOS
npm run build:native:windows # Windows
npm test                     # 95 tests
// ~/Library/Application Support/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "screenhand": {
      "command": "npx",
      "args": ["tsx", "/path/to/screenhand/mcp-desktop.ts"]
    }
  }
}
// .mcp.json or ~/.claude/settings.json
{
  "mcpServers": {
    "screenhand": {
      "command": "npx",
      "args": ["tsx", "/path/to/screenhand/mcp-desktop.ts"]
    }
  }
}
// .cursor/mcp.json
{
  "mcpServers": {
    "screenhand": {
      "command": "npx",
      "args": ["tsx", "/path/to/screenhand/mcp-desktop.ts"]
    }
  }
}
# ~/.codex/config.toml
[mcp.screenhand]
command = "npx"
args = ["tsx", "/path/to/screenhand/mcp-desktop.ts"]
transport = "stdio"
FAQ

Frequently Asked Questions

An open-source MCP server giving AI assistants (Claude, Cursor, Codex) the ability to see and control your desktop. 70+ tools for screenshots, UI inspection, clicking, typing, and browser automation on macOS and Windows. Uses native Accessibility APIs for ~50ms per action.
Computer Use is cloud-based screenshot interpretation. ScreenHand is local-first using native OS APIs — ~100x faster (~50ms vs 2-5s), more reliable (no coordinate guessing), and all data stays on your machine.
Any MCP-compatible client: Claude Desktop, Claude Code, Cursor, Windsurf, OpenAI Codex CLI. Standard MCP over stdio — 3 lines of config.
UI actions ~50ms, Chrome CDP ~10ms, Screenshot+OCR ~600ms, Memory ~0ms. 100x faster than screenshot-only tools because it reads the UI tree directly.
OpenClaw uses screenshots + LLM vision to guess where to click (seconds per action, coordinate-based). ScreenHand uses native Accessibility APIs (~50ms, exact element targeting). ScreenHand is an MCP server that works with any AI client, while OpenClaw is a standalone agent. ScreenHand auto-learns strategies with O(1) recall; OpenClaw has community skills but no automatic learning. ScreenHand is scoped and secure; OpenClaw requires careful sandboxing.
Yes. macOS uses Swift + Accessibility APIs. Windows uses C# .NET 8 + UI Automation. Same protocol, all tools work identically.
Runs locally, never sends data externally. macOS needs Accessibility permission. Windows needs no admin. Dangerous tools are audit-logged.
Yes, free and open-source under the AGPL-3.0 license. Full source at github.com/manushi4/screenhand. Built by Clazro Technology Private Limited.

Give AI Desktop
Superpowers

Open source. AGPL-3.0 licensed. 70+ tools. Native speed. Built for MCP.

Star on GitHub