Headless browser scraping: Playwright vs Puppeteer in 2026

Q: What's the cheapest scraper for low volume?

`fetch` + cheerio for static HTML. Playwright with stock residential proxies for JS-rendered sites. Everything else is overkill until you hit detection.

Q: How do I keep selectors stable when the site redesigns?

Wherever possible, prefer semantic selectors — `getByRole`, `getByText`, `getByLabel` — over CSS class names. Class names change every redeploy. Semantics rarely change.

An opinionated comparison — Playwright vs Puppeteer vs newer alternatives. When each one wins, the bot-detection gap, and what production scraping infra actually looks like.

YAEL Engineering14 Feb 20268 min read1,563 words

In 2026 the right default for headless browser scraping is Playwright. Puppeteer is still excellent and still actively maintained but Playwright's auto-waiting model, multi-browser support, and stronger network interception API make it the safer choice for new projects. The interesting question is no longer Playwright vs Puppeteer — it's whether stock headless is enough, or whether you need anti-detect tooling on top. For ~70% of scraping jobs Playwright stock is fine. For the other 30% you need stealth plugins, residential proxies, and increasingly real browser farms.

We use Playwright across every scraping engagement at YAEL. This is what we've learned from running it in production against sites that don't want to be scraped.

The honest Playwright vs Puppeteer comparison

| | Playwright | Puppeteer | |---|---|---| | Maintained by | Microsoft | Chrome team | | Browsers | Chromium, Firefox, WebKit | Chromium only (officially) | | Auto-wait | Built-in, comprehensive | Manual | | Network interception | Mature, full request/response control | Good but less ergonomic | | Multiple contexts | Native, first-class | Workable | | Selector engine | CSS, XPath, text, role, data-testid | CSS, XPath | | Speed | Roughly tied | Roughly tied | | Anti-bot detection | Detected by sophisticated sites | Same | | Ecosystem | Strong, growing | Mature, stable |

Playwright wins on auto-wait and multi-browser. Puppeteer wins on size of the existing community. Both are detected by the same set of anti-bot platforms.

What Playwright auto-wait actually buys you

In Puppeteer, you write:

await page.click("button.submit");
await page.waitForSelector(".success-toast");
const text = await page.$eval(".result", (el) => el.textContent);

In Playwright:

await page.click("button.submit");
const text = await page.locator(".result").textContent();

Playwright's locator auto-waits for the element to exist and be actionable. The Puppeteer code crashes randomly when the page is mid-render. Playwright handles it transparently.

This single difference cuts our flaky-scraper rate by roughly half. It's the biggest reason we recommend Playwright for new builds.

When stock headless gets caught

Modern anti-bot platforms (Cloudflare Bot Management, DataDome, PerimeterX, Akamai Bot Manager, Kasada) detect headless browsers through:

navigator.webdriver === true
Missing or unusual plugins
Canvas fingerprinting inconsistencies
WebGL fingerprint mismatches
Suspicious timing (no mouse movement, instantaneous clicks)
Suspicious user-agent + IP combinations
TLS fingerprint (JA3, JA4) mismatch with claimed browser

A stock chromium.launch({ headless: true }) fails all of these. We cover the full taxonomy in anti-bot defences: Cloudflare, DataDome, Akamai explained.

The stealth stack

For sites that bot-detect, you escalate:

playwright-extra + puppeteer-extra-plugin-stealth — patches the most obvious detections (navigator.webdriver, plugin list, etc). Free, fast, defeats the bottom 60% of detection.
Residential or mobile proxies — IP reputation is a huge signal. Bright Data, Oxylabs, SmartProxy. ~$5-15 per GB. Defeats IP-based blocking.
Real browser farm — services like Browserless, Browserbase, or self-hosted real Chrome with anti-detect profiles. Defeats canvas/WebGL fingerprinting that headless can't fake.
Captcha solvers — last resort. 2Captcha, CapMonster, Anti-Captcha. Cents per solve.

We escalate one rung at a time and stop at whatever works. Most jobs end at level 2.

// Playwright + stealth + residential proxy
import { chromium } from "playwright-extra";
import stealth from "puppeteer-extra-plugin-stealth";

chromium.use(stealth());

const browser = await chromium.launch({
  proxy: {
    server: "http://proxy.brightdata.com:22225",
    username: process.env.BRIGHT_USER!,
    password: process.env.BRIGHT_PASS!,
  },
  headless: true,
});
const ctx = await browser.newContext({
  userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
  viewport: { width: 1440, height: 900 },
  locale: "en-GB",
  timezoneId: "Europe/London",
});

The user-agent, viewport, locale, and timezone need to match a plausible real browser. Mismatches are signals.

Network interception — Playwright's quiet win

Playwright's request routing is one of its strongest features. Block images and CSS, return mock responses for ads, capture XHR payloads:

// Block heavy resources to speed up scraping
await page.route("**/*", (route) => {
  const t = route.request().resourceType();
  if (t === "image" || t === "media" || t === "font") return route.abort();
  return route.continue();
});

// Capture API responses while navigating
const responses: unknown[] = [];
page.on("response", async (res) => {
  if (res.url().includes("/api/products")) {
    responses.push(await res.json());
  }
});

await page.goto("https://example.com/products");

For sites that load data via XHR, intercepting the API response is often faster and more reliable than scraping the rendered HTML.

Browser per request vs persistent contexts

Two patterns.

Pattern A — fresh browser per job. Slowest but cleanest. No state leaks between jobs. Highest detection resistance because every job looks like a fresh user.

Pattern B — persistent context, reused across jobs. Faster. Useful for site-specific scrapers where you can amortize the cookie/session warm-up. Risk: state leak if you forget to clear cookies between distinct targets.

For most production scraping we run pattern A with a hot-pool of browsers (5-10 warm Playwright instances waiting for jobs). The amortization pays off without the state-leak risk.

// Hot-pool pattern
class BrowserPool {
  private pool: Browser[] = [];
  private size: number;
  async init(size: number) {
    this.size = size;
    this.pool = await Promise.all(
      Array.from({ length: size }, () => chromium.launch()),
    );
  }
  async acquire() {
    const browser = this.pool.pop() ?? (await chromium.launch());
    return {
      browser,
      release: async () => {
        if (this.pool.length < this.size) this.pool.push(browser);
        else await browser.close();
      },
    };
  }
}

The cost model

A short cost-per-scrape comparison:

| Setup | Cost / 1k scrapes | |---|---| | Stock Playwright, datacenter proxies | ~$0.20 | | Playwright + stealth + residential proxies | ~$2-5 | | Browser farm (Browserless / Browserbase) | ~$5-15 | | Real browser + manual captcha solving | ~$20+ |

Plan your job for the lowest tier that works. Don't pay browser farm prices for sites that fall to stealth + residential.

What about playwright-recorder / codegen?

Useful for prototyping a scraper interactively. playwright codegen example.com opens a browser, records your clicks as Playwright code. We use it for the initial pass on a new target site, then refactor the generated code.

bash

pnpm dlx playwright codegen --target javascript https://example.com

Do not ship the codegen output as production code. It uses overly specific selectors (text=Submit in the wrong place) that break on minor UI changes. Refactor into named selectors and add explicit waits.

Newer alternatives worth knowing

A short list:

Browserless — managed Chrome with anti-detect, captcha solving built in. Pay per second of browser time.
Browserbase — newer entrant, simpler API, well-funded.
Camoufox — a fork of Firefox with anti-detect baked in. Open source. Useful when you need to look exactly like Firefox.
Apify — fully managed scraping platform with built-in proxy rotation. Good for non-engineering teams.

For most engineering-led teams, Playwright self-hosted with proxies is the best cost-quality trade. The managed services are right when ops cost matters more than per-scrape cost.

What we ship by default

For a typical scraping engagement at YAEL:

Playwright + playwright-extra + stealth plugin
BullMQ queue with rate limits per target domain
Residential proxies for any site that has bot detection
Per-target adapter modules (one folder per site, isolated selectors)
Snapshot tests on the parsing layer (save HTML, parse, assert)
Daily smoke runs that catch site changes before customers notice

We can describe a typical scrape build in three pages. Most of the production complexity is in the operational layer — queue, retries, observability — not in the scraping code itself.

Need a production scraper that doesn't break weekly?

We've built scraping infrastructure into competitive intelligence platforms, price tracking products, and AI agent retrieval pipelines.

See scraping service

FAQ

Is scraping legal?

Depends on what you scrape, where you are, and what the site's terms say. Generally: public data without bypassing technical controls = grey area. Behind a login or paywall = much riskier. Always read the site's robots.txt and ToS. We are not your lawyer.

Can I scrape JavaScript-rendered sites with `fetch`?

If you can reverse-engineer the API the page calls, yes — and it's much faster than headless. Always check Network tab first. Headless is the fallback when the API isn't usable.

What's the cheapest scraper for low volume?

fetch + cheerio for static HTML. Playwright with stock residential proxies for JS-rendered sites. Everything else is overkill until you hit detection.

Do I need to use a captcha solver?

Only if your target site presents captchas. Most don't until they detect you. Get caught less and you won't need a solver. If you do, 2Captcha at ~$1 per 1k reCAPTCHA v2 solves is the cheapest production option.

What about Selenium?

Don't pick Selenium for a new project. Slower, older, more detectable. Playwright covers everything Selenium does and more.

Can I run Playwright on Vercel / Cloudflare Workers?

Vercel: yes, but slow. Workers: not directly — use Browserless or similar. For sustained scraping workloads, run Playwright on a long-lived VM or container.

How do I keep selectors stable when the site redesigns?

Wherever possible, prefer semantic selectors — getByRole, getByText, getByLabel — over CSS class names. Class names change every redeploy. Semantics rarely change.

What about LLM-based scraping?

Useful for one-off extractions on unstructured pages — give Claude the HTML and ask for structured data. Expensive at scale. We use it for the long-tail "we need data from 500 different sites, each with different HTML" case where building 500 adapters isn't economic.

TagsPlaywright Puppeteer Scraping Headless Automation

ServiceWeb Scraping Services Automation Scripts

Keep reading

Scraping & AutomationAnti-bot defences: Cloudflare, DataDome, Akamai explainedWhat each of the major anti-bot platforms actually does, the signals they read, and which one is hardest to defeat in 2026.9 min read SaaSHow to build a SaaS MVP in 6 weeks (without a rewrite later)A six-week SaaS MVP plan that doesn't trade speed for technical debt — auth, billing, multi-tenancy, and a real operator dashboard from day one.10 min read PaymentsStripe Billing vs Paddle vs LemonSqueezy for SaaS in 2026An opinionated comparison of the three default billing platforms for B2B SaaS — pricing model coverage, MoR vs not, dev DX, and where each one breaks at scale.8 min read

Scraping & Automation

Headless browser scraping: Playwright vs Puppeteer in 2026

An opinionated comparison — Playwright vs Puppeteer vs newer alternatives. When each one wins, the bot-detection gap, and what production scraping infra actually looks like.

YAEL Engineering14 Feb 20268 min read1,563 words

We use Playwright across every scraping engagement at YAEL. This is what we've learned from running it in production against sites that don't want to be scraped.

The honest Playwright vs Puppeteer comparison

Playwright wins on auto-wait and multi-browser. Puppeteer wins on size of the existing community. Both are detected by the same set of anti-bot platforms.

What Playwright auto-wait actually buys you

In Puppeteer, you write:

await page.click("button.submit");
await page.waitForSelector(".success-toast");
const text = await page.$eval(".result", (el) => el.textContent);

In Playwright:

await page.click("button.submit");
const text = await page.locator(".result").textContent();

Playwright's locator auto-waits for the element to exist and be actionable. The Puppeteer code crashes randomly when the page is mid-render. Playwright handles it transparently.

This single difference cuts our flaky-scraper rate by roughly half. It's the biggest reason we recommend Playwright for new builds.

When stock headless gets caught

Modern anti-bot platforms (Cloudflare Bot Management, DataDome, PerimeterX, Akamai Bot Manager, Kasada) detect headless browsers through:

navigator.webdriver === true
Missing or unusual plugins
Canvas fingerprinting inconsistencies
WebGL fingerprint mismatches
Suspicious timing (no mouse movement, instantaneous clicks)
Suspicious user-agent + IP combinations
TLS fingerprint (JA3, JA4) mismatch with claimed browser

A stock chromium.launch({ headless: true }) fails all of these. We cover the full taxonomy in anti-bot defences: Cloudflare, DataDome, Akamai explained.

The stealth stack

For sites that bot-detect, you escalate:

playwright-extra + puppeteer-extra-plugin-stealth — patches the most obvious detections (navigator.webdriver, plugin list, etc). Free, fast, defeats the bottom 60% of detection.
Residential or mobile proxies — IP reputation is a huge signal. Bright Data, Oxylabs, SmartProxy. ~$5-15 per GB. Defeats IP-based blocking.
Real browser farm — services like Browserless, Browserbase, or self-hosted real Chrome with anti-detect profiles. Defeats canvas/WebGL fingerprinting that headless can't fake.
Captcha solvers — last resort. 2Captcha, CapMonster, Anti-Captcha. Cents per solve.

We escalate one rung at a time and stop at whatever works. Most jobs end at level 2.

// Playwright + stealth + residential proxy
import { chromium } from "playwright-extra";
import stealth from "puppeteer-extra-plugin-stealth";

chromium.use(stealth());

const browser = await chromium.launch({
  proxy: {
    server: "http://proxy.brightdata.com:22225",
    username: process.env.BRIGHT_USER!,
    password: process.env.BRIGHT_PASS!,
  },
  headless: true,
});
const ctx = await browser.newContext({
  userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
  viewport: { width: 1440, height: 900 },
  locale: "en-GB",
  timezoneId: "Europe/London",
});

The user-agent, viewport, locale, and timezone need to match a plausible real browser. Mismatches are signals.

Network interception — Playwright's quiet win

Playwright's request routing is one of its strongest features. Block images and CSS, return mock responses for ads, capture XHR payloads:

// Block heavy resources to speed up scraping
await page.route("**/*", (route) => {
  const t = route.request().resourceType();
  if (t === "image" || t === "media" || t === "font") return route.abort();
  return route.continue();
});

// Capture API responses while navigating
const responses: unknown[] = [];
page.on("response", async (res) => {
  if (res.url().includes("/api/products")) {
    responses.push(await res.json());
  }
});

await page.goto("https://example.com/products");

For sites that load data via XHR, intercepting the API response is often faster and more reliable than scraping the rendered HTML.

Browser per request vs persistent contexts

Two patterns.

Pattern A — fresh browser per job. Slowest but cleanest. No state leaks between jobs. Highest detection resistance because every job looks like a fresh user.

For most production scraping we run pattern A with a hot-pool of browsers (5-10 warm Playwright instances waiting for jobs). The amortization pays off without the state-leak risk.

// Hot-pool pattern
class BrowserPool {
  private pool: Browser[] = [];
  private size: number;
  async init(size: number) {
    this.size = size;
    this.pool = await Promise.all(
      Array.from({ length: size }, () => chromium.launch()),
    );
  }
  async acquire() {
    const browser = this.pool.pop() ?? (await chromium.launch());
    return {
      browser,
      release: async () => {
        if (this.pool.length < this.size) this.pool.push(browser);
        else await browser.close();
      },
    };
  }
}

The cost model

A short cost-per-scrape comparison:

Plan your job for the lowest tier that works. Don't pay browser farm prices for sites that fall to stealth + residential.

What about playwright-recorder / codegen?

bash

pnpm dlx playwright codegen --target javascript https://example.com

Newer alternatives worth knowing

A short list:

Browserless — managed Chrome with anti-detect, captcha solving built in. Pay per second of browser time.
Browserbase — newer entrant, simpler API, well-funded.
Camoufox — a fork of Firefox with anti-detect baked in. Open source. Useful when you need to look exactly like Firefox.
Apify — fully managed scraping platform with built-in proxy rotation. Good for non-engineering teams.

For most engineering-led teams, Playwright self-hosted with proxies is the best cost-quality trade. The managed services are right when ops cost matters more than per-scrape cost.

What we ship by default

For a typical scraping engagement at YAEL:

Playwright + playwright-extra + stealth plugin
BullMQ queue with rate limits per target domain
Residential proxies for any site that has bot detection
Per-target adapter modules (one folder per site, isolated selectors)
Snapshot tests on the parsing layer (save HTML, parse, assert)
Daily smoke runs that catch site changes before customers notice

We can describe a typical scrape build in three pages. Most of the production complexity is in the operational layer — queue, retries, observability — not in the scraping code itself.

Need a production scraper that doesn't break weekly?

We've built scraping infrastructure into competitive intelligence platforms, price tracking products, and AI agent retrieval pipelines.

See scraping service

FAQ

Is scraping legal?

Can I scrape JavaScript-rendered sites with `fetch`?

If you can reverse-engineer the API the page calls, yes — and it's much faster than headless. Always check Network tab first. Headless is the fallback when the API isn't usable.

What's the cheapest scraper for low volume?

fetch + cheerio for static HTML. Playwright with stock residential proxies for JS-rendered sites. Everything else is overkill until you hit detection.

Do I need to use a captcha solver?

What about Selenium?

Don't pick Selenium for a new project. Slower, older, more detectable. Playwright covers everything Selenium does and more.

Can I run Playwright on Vercel / Cloudflare Workers?

Vercel: yes, but slow. Workers: not directly — use Browserless or similar. For sustained scraping workloads, run Playwright on a long-lived VM or container.

How do I keep selectors stable when the site redesigns?

Wherever possible, prefer semantic selectors — getByRole, getByText, getByLabel — over CSS class names. Class names change every redeploy. Semantics rarely change.

What about LLM-based scraping?

TagsPlaywright Puppeteer Scraping Headless Automation

ServiceWeb Scraping Services Automation Scripts

The honest Playwright vs Puppeteer comparison

What Playwright auto-wait actually buys you

When stock headless gets caught

The stealth stack

Network interception — Playwright's quiet win

Browser per request vs persistent contexts

The cost model

What about playwright-recorder / codegen?

Newer alternatives worth knowing

What we ship by default

Need a production scraper that doesn't break weekly?

FAQ

Is scraping legal?

Can I scrape JavaScript-rendered sites with fetch?

What's the cheapest scraper for low volume?

Do I need to use a captcha solver?

What about Selenium?

Can I run Playwright on Vercel / Cloudflare Workers?

How do I keep selectors stable when the site redesigns?

What about LLM-based scraping?

Keep reading

The honest Playwright vs Puppeteer comparison

What Playwright auto-wait actually buys you

When stock headless gets caught

The stealth stack

Network interception — Playwright's quiet win

Browser per request vs persistent contexts

The cost model

What about playwright-recorder / codegen?

Newer alternatives worth knowing

What we ship by default

Need a production scraper that doesn't break weekly?

FAQ

Is scraping legal?

Can I scrape JavaScript-rendered sites with fetch?

What's the cheapest scraper for low volume?

Do I need to use a captcha solver?

What about Selenium?

Can I run Playwright on Vercel / Cloudflare Workers?

How do I keep selectors stable when the site redesigns?

What about LLM-based scraping?

Keep reading

Can I scrape JavaScript-rendered sites with `fetch`?

Can I scrape JavaScript-rendered sites with `fetch`?