Architecture
Overview
┌──────────────┐
│ Webtask │ Browser lifecycle management
└──────┬───────┘
│ creates
↓
┌──────────────┐
│ Agent │ Task execution with LLM (text/visual/full mode)
└──────┬───────┘
│ uses
↓
┌──────────────┐
│ TaskRunner │ Executes steps with tools
└──────┬───────┘
│ controls
↓
┌──────────────┐
│AgentBrowser │ Page management, context building, coordinate scaling
└──────────────┘
Components
Webtask - Manages browser lifecycle, creates agents
Agent - Main interface with do(), verify(), extract() methods. Supports three modes.
TaskRunner - Executes tasks by calling LLM with available tools
AgentBrowser - Manages pages, builds context (DOM and/or screenshots), scales coordinates
Three Modes
- text - DOM-based tools (click, fill, type by element ID)
- visual - Pixel-based tools (click_at, type_text_at by coordinates)
- full - Both DOM and pixel tools
How it Works
- User calls
agent.do("task description") - TaskRunner builds context based on mode (DOM, screenshot, or both)
- LLM decides which tools to call
- TaskRunner executes tools via AgentBrowser
- Repeat until task complete or max steps reached
Tools
Common tools (all modes):
- goto - Navigate to URL
- wait - Wait for time
- go_back / go_forward - Browser history
- complete_work / abort_work - Task completion
Text mode tools:
- click - Click element by ID
- fill - Fill form field by ID
- type - Type into element by ID
Visual mode tools:
- click_at - Click at coordinates
- type_text_at - Type at coordinates
- hover_at - Hover at coordinates
- scroll_at - Scroll at coordinates
- drag_and_drop - Drag between coordinates
Stateful Mode
When stateful=True (default), Agent maintains conversation history across do() calls, allowing context to carry over between tasks.