Skip to main content

Architecture Overview

Contop uses a tri-node architecture where three components collaborate to deliver AI-powered remote desktop control.

System Topology

Node Responsibilities

Node 1: Mobile Client

  • Voice input via configurable STT (Google STT default, also supports OpenAI Whisper and OpenRouter)
  • Configurable conversation model (Gemini default, also supports OpenAI, Anthropic, OpenRouter) for intent classification
  • Execution thread UI rendering
  • Remote screen display via WebRTC video track
  • Manual control input (joystick, clicks, keyboard)
  • Session persistence and history

Node 2: Contop Server (Python / FastAPI)

  • WebRTC signaling (SDP/ICE exchange)
  • ADK execution agent with 30+ tools (40+ with optional skills)
  • Dual-Tool Evaluator security classification
  • Vision pipeline (9 backends for screen understanding)
  • JSONL audit logging
  • Skills engine (prompt, workflow, python, mixed)

Node 3: Desktop Host (Tauri / Rust)

  • Server lifecycle management (start/stop/restart sidecar)
  • Settings GUI and QR code display
  • Away Mode overlay with keyboard blocking
  • API key storage in local settings.json
  • CLI proxy lifecycle management (start/stop/health monitoring for subscription mode)
  • Device monitoring and OS notifications

Why Three Nodes?

ConcernHandled By
User interfaceMobile Client — always in your pocket
AI reasoning + tool executionContop Server — needs desktop OS access
Native OS integrationDesktop Host — Rust for low-level APIs
Security isolationSplit between Server (evaluator) and Host (sandboxing)

The mobile client handles user interaction; the server handles AI reasoning and tool execution; the desktop host handles native OS integration that Python can't do (overlay windows, keyboard hooks, process tree management).

Communication Protocols

PathProtocolPurpose
Phone ↔ ServerWebRTC Data Channel (DTLS)Commands, progress, results
Phone ← ServerWebRTC Video Track (SRTP)Live screen feed
Phone → ServerWebSocket (initial only)SDP/ICE signaling exchange
Server ↔ DesktopHTTP localhostSettings, health, proxy lifecycle
Server → CloudHTTPSLLM API calls, tunnel management
Server → CLI ProxyHTTP localhostSubscription mode LLM routing
Subscription Mode Vision Limitation

CLI tools (claude -p, gemini, codex) accept only text — they cannot receive base64 images. In subscription mode, the execution agent's LLM vision fallback (direct screenshot analysis when no local vision backend processes a frame) is unavailable. The agent falls back to text-only tools like get_ui_context. The mobile app shows a NO VISION badge on the execution model card when subscription mode is active.


Related: Contop Server · WebRTC Transport · Tool Layers