Headless browser scraping: Playwright vs Puppeteer in 2026
An opinionated comparison — Playwright vs Puppeteer vs newer alternatives. When each one wins, the bot-detection gap, and what production scraping infra actually looks like.
In 2026 the right default for headless browser scraping is Playwright. Puppeteer is still excellent and still actively maintained but Playwright's auto-waiting model, multi-browser support, and stronger network interception API make it the safer choice for new projects. The interesting question is no longer Playwright vs Puppeteer — it's whether stock headless is enough, or whether you need anti-detect tooling on top. For ~70% of scraping jobs Playwright stock is fine. For the other 30% you need stealth plugins, residential proxies, and increasingly real browser farms.
We use Playwright across every scraping engagement at YAEL. This is what we've learned from running it in production against sites that don't want to be scraped.
The honest Playwright vs Puppeteer comparison
| | Playwright | Puppeteer | |---|---|---| | Maintained by | Microsoft | Chrome team | | Browsers | Chromium, Firefox, WebKit | Chromium only (officially) | | Auto-wait | Built-in, comprehensive | Manual | | Network interception | Mature, full request/response control | Good but less ergonomic | | Multiple contexts | Native, first-class | Workable | | Selector engine | CSS, XPath, text, role, data-testid | CSS, XPath | | Speed | Roughly tied | Roughly tied | | Anti-bot detection | Detected by sophisticated sites | Same | | Ecosystem | Strong, growing | Mature, stable |
Playwright wins on auto-wait and multi-browser. Puppeteer wins on size of the existing community. Both are detected by the same set of anti-bot platforms.
What Playwright auto-wait actually buys you
In Puppeteer, you write:
await page.click("button.submit");
await page.waitForSelector(".success-toast");
const text = await page.$eval(".result", (el) => el.textContent);In Playwright:
await page.click("button.submit");
const text = await page.locator(".result").textContent();Playwright's locator auto-waits for the element to exist and be actionable. The Puppeteer code crashes randomly when the page is mid-render. Playwright handles it transparently.
This single difference cuts our flaky-scraper rate by roughly half. It's the biggest reason we recommend Playwright for new builds.
When stock headless gets caught
Modern anti-bot platforms (Cloudflare Bot Management, DataDome, PerimeterX, Akamai Bot Manager, Kasada) detect headless browsers through:
navigator.webdriver === true- Missing or unusual plugins
- Canvas fingerprinting inconsistencies
- WebGL fingerprint mismatches
- Suspicious timing (no mouse movement, instantaneous clicks)
- Suspicious user-agent + IP combinations
- TLS fingerprint (JA3, JA4) mismatch with claimed browser
A stock chromium.launch({ headless: true }) fails all of these. We cover the full taxonomy in anti-bot defences: Cloudflare, DataDome, Akamai explained.
The stealth stack
For sites that bot-detect, you escalate:
playwright-extra+puppeteer-extra-plugin-stealth— patches the most obvious detections (navigator.webdriver, plugin list, etc). Free, fast, defeats the bottom 60% of detection.- Residential or mobile proxies — IP reputation is a huge signal. Bright Data, Oxylabs, SmartProxy. ~$5-15 per GB. Defeats IP-based blocking.
- Real browser farm — services like Browserless, Browserbase, or self-hosted real Chrome with anti-detect profiles. Defeats canvas/WebGL fingerprinting that headless can't fake.
- Captcha solvers — last resort. 2Captcha, CapMonster, Anti-Captcha. Cents per solve.
We escalate one rung at a time and stop at whatever works. Most jobs end at level 2.
// Playwright + stealth + residential proxy
import { chromium } from "playwright-extra";
import stealth from "puppeteer-extra-plugin-stealth";
chromium.use(stealth());
const browser = await chromium.launch({
proxy: {
server: "http://proxy.brightdata.com:22225",
username: process.env.BRIGHT_USER!,
password: process.env.BRIGHT_PASS!,
},
headless: true,
});
const ctx = await browser.newContext({
userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
viewport: { width: 1440, height: 900 },
locale: "en-GB",
timezoneId: "Europe/London",
});The user-agent, viewport, locale, and timezone need to match a plausible real browser. Mismatches are signals.
Network interception — Playwright's quiet win
Playwright's request routing is one of its strongest features. Block images and CSS, return mock responses for ads, capture XHR payloads:
// Block heavy resources to speed up scraping
await page.route("**/*", (route) => {
const t = route.request().resourceType();
if (t === "image" || t === "media" || t === "font") return route.abort();
return route.continue();
});
// Capture API responses while navigating
const responses: unknown[] = [];
page.on("response", async (res) => {
if (res.url().includes("/api/products")) {
responses.push(await res.json());
}
});
await page.goto("https://example.com/products");For sites that load data via XHR, intercepting the API response is often faster and more reliable than scraping the rendered HTML.
Browser per request vs persistent contexts
Two patterns.
Pattern A — fresh browser per job. Slowest but cleanest. No state leaks between jobs. Highest detection resistance because every job looks like a fresh user.
Pattern B — persistent context, reused across jobs. Faster. Useful for site-specific scrapers where you can amortize the cookie/session warm-up. Risk: state leak if you forget to clear cookies between distinct targets.
For most production scraping we run pattern A with a hot-pool of browsers (5-10 warm Playwright instances waiting for jobs). The amortization pays off without the state-leak risk.
// Hot-pool pattern
class BrowserPool {
private pool: Browser[] = [];
private size: number;
async init(size: number) {
this.size = size;
this.pool = await Promise.all(
Array.from({ length: size }, () => chromium.launch()),
);
}
async acquire() {
const browser = this.pool.pop() ?? (await chromium.launch());
return {
browser,
release: async () => {
if (this.pool.length < this.size) this.pool.push(browser);
else await browser.close();
},
};
}
}The cost model
A short cost-per-scrape comparison:
| Setup | Cost / 1k scrapes | |---|---| | Stock Playwright, datacenter proxies | ~$0.20 | | Playwright + stealth + residential proxies | ~$2-5 | | Browser farm (Browserless / Browserbase) | ~$5-15 | | Real browser + manual captcha solving | ~$20+ |
Plan your job for the lowest tier that works. Don't pay browser farm prices for sites that fall to stealth + residential.
What about playwright-recorder / codegen?
Useful for prototyping a scraper interactively. playwright codegen example.com opens a browser, records your clicks as Playwright code. We use it for the initial pass on a new target site, then refactor the generated code.
pnpm dlx playwright codegen --target javascript https://example.comDo not ship the codegen output as production code. It uses overly specific selectors (text=Submit in the wrong place) that break on minor UI changes. Refactor into named selectors and add explicit waits.
Newer alternatives worth knowing
A short list:
- Browserless — managed Chrome with anti-detect, captcha solving built in. Pay per second of browser time.
- Browserbase — newer entrant, simpler API, well-funded.
- Camoufox — a fork of Firefox with anti-detect baked in. Open source. Useful when you need to look exactly like Firefox.
- Apify — fully managed scraping platform with built-in proxy rotation. Good for non-engineering teams.
For most engineering-led teams, Playwright self-hosted with proxies is the best cost-quality trade. The managed services are right when ops cost matters more than per-scrape cost.
What we ship by default
For a typical scraping engagement at YAEL:
- Playwright +
playwright-extra+ stealth plugin - BullMQ queue with rate limits per target domain
- Residential proxies for any site that has bot detection
- Per-target adapter modules (one folder per site, isolated selectors)
- Snapshot tests on the parsing layer (save HTML, parse, assert)
- Daily smoke runs that catch site changes before customers notice
We can describe a typical scrape build in three pages. Most of the production complexity is in the operational layer — queue, retries, observability — not in the scraping code itself.
Need a production scraper that doesn't break weekly?
We've built scraping infrastructure into competitive intelligence platforms, price tracking products, and AI agent retrieval pipelines.
FAQ
Is scraping legal?
Depends on what you scrape, where you are, and what the site's terms say. Generally: public data without bypassing technical controls = grey area. Behind a login or paywall = much riskier. Always read the site's robots.txt and ToS. We are not your lawyer.
Can I scrape JavaScript-rendered sites with fetch?
If you can reverse-engineer the API the page calls, yes — and it's much faster than headless. Always check Network tab first. Headless is the fallback when the API isn't usable.
What's the cheapest scraper for low volume?
fetch + cheerio for static HTML. Playwright with stock residential proxies for JS-rendered sites. Everything else is overkill until you hit detection.
Do I need to use a captcha solver?
Only if your target site presents captchas. Most don't until they detect you. Get caught less and you won't need a solver. If you do, 2Captcha at ~$1 per 1k reCAPTCHA v2 solves is the cheapest production option.
What about Selenium?
Don't pick Selenium for a new project. Slower, older, more detectable. Playwright covers everything Selenium does and more.
Can I run Playwright on Vercel / Cloudflare Workers?
Vercel: yes, but slow. Workers: not directly — use Browserless or similar. For sustained scraping workloads, run Playwright on a long-lived VM or container.
How do I keep selectors stable when the site redesigns?
Wherever possible, prefer semantic selectors — getByRole, getByText, getByLabel — over CSS class names. Class names change every redeploy. Semantics rarely change.
What about LLM-based scraping?
Useful for one-off extractions on unstructured pages — give Claude the HTML and ask for structured data. Expensive at scale. We use it for the long-tail "we need data from 500 different sites, each with different HTML" case where building 500 adapters isn't economic.