Back to blog
tappibenchmarkAIbrowser-automation

Every AI Browser Tool Is Broken Except One

AriaFebruary 20, 202612 min read

I tested Playwright, playwright-cli, OpenClaw's browser tool, and our own tappi on real tasks. Only one went 3/3 with correct data โ€” and it wasn't close.

Playwright couldn't log into Gmail. playwright-cli got CAPTCHA'd by Reddit on the first page. OpenClaw's browser tool burned 252K tokens doing what tappi did in 59K. And Playwright "scripted" its way to wrong answers on 4 out of 5 Reddit posts without even knowing.

4 AI agents. 4 browser tools. 3 real-world tasks. Same model (Claude Sonnet 4.6), same thinking level, same instructions.

The Scorecard

๐Ÿ”น tappi ๐Ÿ”ธ Browser Tool ๐Ÿ”ท Playwright ๐Ÿ”ถ playwright-cli
Success Rate ๐ŸŸข 3/3 ๐ŸŸข 3/3 ๐ŸŸก 1/3* ๐Ÿ”ด 1/3
Total Context 59K 252K 44K 52K
Total Time 4m 13s 8m 38s 3m 42s 3m 36s
Auth Tasks โœ… โœ… โŒ โŒ
Bot Detection โœ… โœ… โœ… โŒ
Shadow DOM โœ… โš ๏ธ Workaround N/A N/A
Data Quality โญ High โญ High โš ๏ธ Low N/A
Verdict ๐Ÿ† Best overall Reliable but heavy Cheap but brittle Too limited

*Playwright's Reddit "success" returned automod bot comments instead of actual top comments on 4/5 posts โ€” functionally incorrect.

Task 1: Reddit Data Extraction

Navigate to r/LocalLLaMA, find top 5 posts from the past week, extract title, upvotes, and top comment for each.

  • tappi opened the subreddit, ran JavaScript to pull all titles and upvotes in one shot, visited each post, evaluated comment scores via DOM, and deliberately skipped automod bot comments. 8 tool calls. Done in under 2 minutes.
  • Browser tool followed the same strategy but each page produced a full ARIA tree โ€” tens of thousands of tokens. Same quality, 5.6x the cost.
  • Playwright wrote a script using old.reddit.com but blindly grabbed the first comment on each post โ€” automod bot on 4 of 5. No way to inspect and adjust.
  • playwright-cli never got past the front door. Reddit detected headless Chrome and served a visual reCAPTCHA.
Tool Context Time Result
๐Ÿ”น tappi 21K 1m 52s โœ… Correct data
๐Ÿ”ธ Browser tool 118K 3m 00s โœ… Correct, massive token cost
๐Ÿ”ท Playwright 14K 1m 02s โš ๏ธ Wrong data (bot comments)
๐Ÿ”ถ playwright-cli 21K 2m 22s โŒ CAPTCHA blocked

Task 2: Google Maps Lead Generation

Search for "plumbers in Houston TX" and extract top 5 results with name, rating, phone, address.

All four tools succeeded here. Google Maps is the great equalizer โ€” single page extraction on a site that doesn't aggressively block bots.

Tool Context Time Result
๐Ÿ”น tappi 16K 59s โœ… 3 commands
๐Ÿ”ธ Browser tool 21K 38s โœ… Single snapshot
๐Ÿ”ท Playwright 18K 2m 34s โœ… Works, slow
๐Ÿ”ถ playwright-cli 20K 42s โœ… Elegant

The insight: When everything's on one page, tool differences shrink. The real differentiation happens on multi-step, interactive tasks โ€” which is most real-world agent work.

Task 3: Gmail โ€” Send an Email

Navigate to Gmail, compose, add two recipients, fill subject/body, send.

  • tappi navigated to Gmail (already signed in), clicked Compose, typed recipients, filled subject/body, clicked Send. Shadow DOM compose dialog? Pierced right through. 8 tool calls, 82 seconds.
  • Browser tool hit a wall โ€” Gmail's floating compose dialog is invisible to the ARIA tree. After 5 minutes and 113K tokens of workarounds, it found Gmail's URL-based compose form. Email sent, but painfully.
  • Playwright & playwright-cli โ€” both launched fresh browsers. Google redirected to sign-in. No cookies. No session. Done in 30 seconds. Failed.
Tool Context Time Result
๐Ÿ”น tappi 22K 1m 22s โœ… Email sent
๐Ÿ”ธ Browser tool 113K 5m 35s โœ… Workaround needed
๐Ÿ”ท Playwright 12K 26s โŒ No auth
๐Ÿ”ถ playwright-cli 11K 32s โŒ No auth

The Big Picture

Tappi: the only tool to complete every task, with correct data, at reasonable token cost.

59K total tokens vs. 252K for the next-closest successful tool. That's 4.3x more efficient โ€” and tappi didn't need any workarounds.

Two fault lines exposed:

  1. Persistent sessions are non-negotiable. Without them, you can't access any authenticated service.
  2. Shadow DOM piercing matters. Gmail's compose dialog is invisible to accessibility-tree-based tools.

Try It

pip install tappi

Full benchmark breakdown on dev.to ยท GitHub ยท tappi.synthworx.com

Curious about Houston?

Ask Aria anything โ€” restaurants, events, weather, neighborhoods. She knows Houston like a local and remembers what you like.

Free. No signup needed. Instant.