Snapshots

Playwright MCP uses accessibility snapshots instead of screenshots. Every tool that interacts with the page returns a structured tree of accessible elements with refs for interaction.

Snapshot format

- heading "todos" [level=1]
- textbox "What needs to be done?" [ref=e5]
- listitem:
  - checkbox "Toggle Todo" [ref=e10]
  - text: "Buy groceries"
- listitem:
  - checkbox "Toggle Todo" [ref=e14]
  - text: "Water flowers"
- contentinfo:
  - text: "2 items left"
  - link "All" [ref=e20]
  - link "Active" [ref=e21]
  - link "Completed" [ref=e22]

Each interactive element gets a unique ref (e.g., ref=e5). The LLM uses these refs to interact:

browser_type   { ref: "e5", text: "headphones" }    → type into search
browser_click  { ref: "e10" }                       → check the checkbox
browser_click  { ref: "e20" }                       → click the "All" link

Element refs

Refs are stable within a single snapshot — the same element always has the same ref until the page changes. After navigation or DOM updates, the tool returns a fresh snapshot with new refs.

Property	Detail
Format	`e` followed by a number (e.g., `e1`, `e15`, `e203`)
Scope	Unique within a single snapshot
Lifetime	Valid until the next page change
Assignment	Only interactive elements get refs (buttons, links, inputs, etc.)

On-demand snapshots

Use browser_snapshot to capture the page state on demand. Most tools also return a snapshot automatically after each action, so the LLM always has up-to-date page state.

Snapshots with screenshots

For pages where visual context matters (canvas apps, charts, image-heavy layouts), combine snapshots with screenshots:

Take a snapshot and a screenshot of the current page.

The LLM gets both the structured accessibility tree for interaction and the visual screenshot for understanding layout. See Vision Mode for coordinate-based interaction using screenshots.

Why snapshots over screenshots

	Snapshots	Screenshots
Token cost	~200-400 tokens	~3000-5000 tokens (vision model)
Precision	Exact — refs point to specific elements	Approximate — requires coordinate guessing
Speed	Instant — text parsing	Slower — vision model inference
Reliability	Deterministic — same structure = same interaction	Variable — layout changes break coordinates
Vision model	Not required	Required

Best practices

Use refs, not selectors — refs from snapshots are more reliable than CSS selectors because they point to the exact element the LLM just saw
Re-snapshot after navigation — refs are invalidated when the page changes
Combine with screenshots — when visual context is needed alongside structured data
Check for dialogs — if a tool reports a dialog is open, handle it before proceeding with other actions

Snapshot format​

Element refs​

On-demand snapshots​

Snapshots with screenshots​

Why snapshots over screenshots​

Best practices​