Skip to main content

Vision Routing

Contop's vision system routes screen understanding tasks through 9 different backends, each optimized for different scenarios.

Pipeline

  1. Capturemss screen capture (thread-local instances — mss is not thread-safe)
  2. Downscale — Resize to 1280px max width for consistent coordinate space
  3. Backend selection — Route to the configured vision backend
  4. Parse — Detect UI elements, extract coordinates and labels
  5. Cache — Results cached behind _parse_lock (thread-safe) for shared access between observe_screen and execute_gui

Nine Vision Backends

BackendSource ModuleCoordinatesUse Case
UI-TARS (default)tools/vision_client.pyPixel coordinatesFast single API call via OpenRouter
OmniParsertools/omniparser_local.py0–1 normalized bboxLocal-first, privacy-preserving, GPU/CPU adaptive
Gemini Computer Usetools/gemini_computer_use.py0–999 normalizedPlanning-only adapter with stateful history
Accessibility Treetools/ui_automation.pyElement-based (no coords)Deterministic, best for native app UIs
Kimi Visiontools/vision_client.pyPixel coordinatesAlternative VLM via OpenRouter
Qwen Visiontools/vision_client.pyPixel coordinatesAlternative VLM via OpenRouter
Phi Visiontools/vision_client.pyPixel coordinatesAlternative VLM via OpenRouter
Molmo Visiontools/vision_client.pyPixel coordinatesAlternative VLM via OpenRouter
Holotron Visiontools/vision_client.pyPixel coordinatesAlternative VLM via OpenRouter

Dual-Mode Vision

Grounding Mode (mode="grounding")

Returns element coordinates for GUI automation:

  • Bounding boxes around clickable elements
  • Text labels from OCR
  • Element types (button, text field, checkbox, etc.)

Understanding Mode (mode="understanding")

Returns a natural language description of the screen:

  • What's visible on screen
  • Layout and content summary
  • Used for context gathering before complex tasks

Coordinate Systems

Each backend uses a different coordinate system. The GUI automation layer normalizes everything to native screen coordinates:

BackendRaw CoordinatesConversion
UI-TARSPixel (screenshot-space)_scale(x, y, sx, sy)
OmniParser0–1 normalizedx_pixel = bbox[0] * img_width, then scale
Gemini CU0–999 normalizedpixel = (normalized / 1000) * capture_dim
AccessibilityElement referencesNo coordinate conversion needed

The _scale(x, y, sx, sy) function in gui_automation.py converts from screenshot-space (1280px max width) to native screen coordinates using scale factors from the frame context payload.

OmniParser Details

OmniParser runs locally and adapts to available hardware:

ModeFeatures
GPUFlorence-2 captioning enabled, YOLO at imgsz=640
CPUCaptioning disabled (~30s too slow), YOLO at imgsz=416 (2.4x speedup)

Key thresholds: BOX_THRESHOLD=0.20, IOU_THRESHOLD=0.7, OCR_CONFIDENCE=0.8

Deduplication merges overlapping boxes (IoU > 0.8), preferring icon detections over OCR text.

Backend Fallback

Vision backends degrade gracefully:

  • UI-TARS → Returns None on failure (rate limit, timeout) → Agent falls back to OmniParser
  • OmniParser understanding mode → Bypasses to needs_llm_vision=True for direct LLM interpretation
  • Accessibility backend → Falls back to observe_screen + execute_gui when elements not found

Hallucination Guardrails

Vision system prompts enforce:

  • No Fabricated Details — Report only what's visible on screen
  • Verify After execute_gui Type — Always re-observe screen after typing to confirm text was entered correctly

Related: Tool Layers · ADK Agent · Agent Execution