Safe Web Scraping with Claude Code: Fetch, Playwright, and Audit Logs

Define the Boundary Before Claude Code Writes Code

Web scraping means reading information from web pages with software. It is useful for monitoring, documentation inventories, research, and quality checks, but it is not a free pass to collect everything that appears in a browser. Claude Code can make the implementation fast, so the boundary has to be written before the code is generated: public data only, respect terms, respect robots.txt, throttle requests, avoid personal data unless you have a lawful basis, and keep an audit trail.

The beginner-friendly order is simple. First look for an official API, RSS feed, sitemap, CSV export, or other documented data access path. These are usually more stable than HTML and come with clearer usage rules. Only move to HTML scraping when the task is legitimate, low volume, and there is no better structured source.

This article uses Claude Code as the implementation assistant, not as an excuse to bypass controls. It does not cover login bypass, CAPTCHA evasion, bot-protection bypass, mass email harvesting, or collection of restricted data. If a workflow touches personal information, regulated data, or outbound sales messages, confirm your legal basis, privacy policy, retention period, opt-out path, and local compliance obligations first.

Use primary references when you design the workflow: RFC 9309 for the robots.txt protocol, Google’s robots.txt documentation for search crawling behavior, MDN Fetch API for browser-standard fetching, and Playwright Browser contexts for isolated browser automation.

The Workflow

flowchart TD
  A["One clear purpose"] --> B["Check official API, RSS, and sitemap"]
  B --> C["Review terms and robots.txt"]
  C --> D{"Can static HTML answer it?"}
  D -->|Yes| E["Use Fetch one page at a time"]
  D -->|No, owned or approved| F["Use Playwright for rendered DOM"]
  E --> G["Save CSV plus URL and timestamp"]
  F --> G
  G --> H["Human sample review before use"]

That structure gives Claude Code concrete constraints. Instead of asking it to “scrape this site,” ask it to “fetch one allowed page from this origin, wait at least two seconds, write sourceUrl and fetchedAt to CSV, and stop when robots.txt blocks the path.” The more explicit the boundary, the less likely the generated code will drift into brittle selectors, aggressive loops, or data you should not store.

Fetch vs Playwright

Use fetch when the needed information is already in the HTML response. Static documentation pages, blog posts, pricing pages, status pages, and sitemap-driven URL checks often fit this model. Fetch is easier to audit because it makes a plain HTTP request and returns text. It is also lighter than launching a browser.

Use Playwright only when a real browser is necessary and the page is owned by you or explicitly approved for automation. Examples include a local preview of your Astro or Next.js site, a staging environment, or a consented internal QA workflow. Browser automation carries more operational risk because it loads scripts, cookies, storage, images, and client-side behavior. Keep contexts isolated so cookies and localStorage from one task do not leak into another.

Ask Claude Code to start with the Fetch version. Add Playwright only after it proves the page cannot be checked from static HTML. In review, look for fixed sleeps, accidental use of logged-in sessions, selectors that depend on styling classes, missing rate limits, and missing audit fields.

Practical Use Cases

The first use case is monitoring your own site. Check training pages, product pages, forms, article pages, canonical URLs, titles, and CTA text. Because you own the site, you can document the robots policy, create stable selectors, and run the check at a respectful cadence. This connects naturally with AI content operations and content funnel audits: the scraper tells you what changed, and the content workflow decides what to fix.

The second use case is collecting public documentation URLs. A team may want an index of official docs, internal handbook pages, or public knowledge-base articles. In many cases you do not need to store the full text. URL, title, source timestamp, and a short status field are enough to support search, review, or editorial planning without copying more content than necessary.

The third use case is manually reviewed competitor pricing checks. If a pricing page is public, low-volume monitoring can help a product or sales team notice plan-name changes or campaign language. The output should never be treated as truth by itself. Prices vary by region, tax, currency, and promotional conditions. Keep source URLs and timestamps, then have a human review samples before updating strategy, sales material, or comparison pages.

The fourth use case is lead research with guardrails. It is reasonable to collect a company name, public website, industry, or official contact page for a small research list. It is not reasonable to blindly harvest personal email addresses and push them into an outbound campaign. If outreach follows, add opt-out handling, sender identity, suppression lists, and human review. Pair this with Claude Code email automation only after the collection step is lawful and minimal.

Common Failure Cases

Ignoring robots.txt and terms is the fastest way to make the workflow indefensible. robots.txt is not the whole legal analysis, but it is a published machine-readable boundary and should be respected. Terms of service may add additional restrictions, especially for automated collection, reuse, or commercial monitoring.

Collecting emails blindly is another common mistake. Publicly visible personal data can still be personal data. If you do not need it, do not collect it. If you do need it, document the purpose, legal basis, retention period, access controls, deletion process, and opt-out route.

No rate limit is a technical and reputational risk. A loop that requests hundreds of pages without delay can look like an attack from the other side. Use small batches, clear user-agent text, request spacing, limited retries, and stop-on-error behavior.

Brittle selectors create quiet data quality failures. A selector like .card > div:nth-child(2) may work today and silently fail after a design update. Prefer semantic HTML, time[datetime], main h1, or owner-controlled data attributes. Missing required selectors should fail the job and write a diagnostic record.

Bypassing protections is not a feature request. If Claude Code suggests CAPTCHA workarounds, bot-protection evasion, login-wall scraping, rotating identities, or rate-limit circumvention, stop the task and redesign it around an approved data source.

Sensitive storage is the last major pitfall. Do not dump raw HTML, authenticated data, tokens, customer information, or personal records into unreviewed CSV files. Store the smallest useful fields, include source metadata, and delete outputs that are no longer needed. For broader defensive practices, read Claude Code security best practices.

Copy-Paste Fetch Scraper

The following script runs on Node 18 or newer. It fetches one allowed page, checks robots.txt conservatively, throttles the request, extracts basic page fields, writes a CSV with source URL and timestamp, and writes a JSON audit record. If robots.txt is missing, it stops unless you explicitly set ALLOW_WITHOUT_ROBOTS=true, which is useful only for your own site while you are still adding the file.

// scrape-allowed-page.mjs
import { writeFile } from "node:fs/promises";

const USER_AGENT = "ClaudeCodeLabAuditBot/1.0 (+https://example.com/bot-info)";
const BOT_TOKEN = "ClaudeCodeLabAuditBot";
const targetUrl = new URL(process.env.SCRAPE_URL ?? "https://example.com/");
const allowedOrigins = (process.env.ALLOWED_ORIGINS ?? "https://example.com")
  .split(",")
  .map((value) => new URL(value.trim()).origin);
const delayMs = Number.parseInt(process.env.REQUEST_DELAY_MS ?? "2000", 10);

if (!allowedOrigins.includes(targetUrl.origin)) {
  throw new Error(`Blocked by allowlist: ${targetUrl.origin}`);
}

function sleep(ms) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

async function fetchText(url, accept) {
  await sleep(delayMs);
  return fetch(url, {
    headers: {
      "user-agent": USER_AGENT,
      accept,
    },
  });
}

async function loadRobots(origin) {
  const robotsUrl = new URL("/robots.txt", origin);
  const response = await fetchText(robotsUrl, "text/plain");
  if (response.status === 404) {
    return { url: robotsUrl.toString(), status: response.status, text: null };
  }
  if (!response.ok) {
    throw new Error(`robots.txt check failed: HTTP ${response.status}`);
  }
  return {
    url: robotsUrl.toString(),
    status: response.status,
    text: await response.text(),
  };
}

function parseRobots(text) {
  const groups = [];
  let agents = [];
  let rules = [];

  function commit() {
    if (agents.length > 0) {
      groups.push({ agents, rules });
    }
    agents = [];
    rules = [];
  }

  for (const rawLine of text.split(/\r?\n/)) {
    const cleaned = rawLine.split("#")[0].trim();
    if (!cleaned) continue;
    const separator = cleaned.indexOf(":");
    if (separator === -1) continue;

    const field = cleaned.slice(0, separator).trim().toLowerCase();
    const value = cleaned.slice(separator + 1).trim();

    if (field === "user-agent") {
      if (rules.length > 0) commit();
      agents.push(value.toLowerCase());
      continue;
    }

    if ((field === "allow" || field === "disallow") && agents.length > 0) {
      rules.push({ type: field, path: value });
    }
  }

  commit();
  return groups;
}

function escapeRegExp(value) {
  return value.replace(/[.*+?^${}()|[\]\\]/g, "\\$&");
}

function pathMatches(pattern, path) {
  if (!pattern) return false;
  const exact = pattern.endsWith("$");
  const normalized = exact ? pattern.slice(0, -1) : pattern;
  const source = `^${escapeRegExp(normalized).replace(/\\\*/g, ".*")}${exact ? "$" : ""}`;
  return new RegExp(source).test(path);
}

function isAllowedByRobots(robotsText, url) {
  if (robotsText === null) {
    return process.env.ALLOW_WITHOUT_ROBOTS === "true";
  }

  const groups = parseRobots(robotsText);
  const bot = BOT_TOKEN.toLowerCase();
  const exactGroups = groups.filter((group) =>
    group.agents.some((agent) => agent !== "*" && bot.includes(agent)),
  );
  const fallbackGroups = groups.filter((group) => group.agents.includes("*"));
  const selectedGroups = exactGroups.length > 0 ? exactGroups : fallbackGroups;
  const rules = selectedGroups.flatMap((group) => group.rules);
  const targetPath = `${url.pathname}${url.search}`;
  let winner = null;

  for (const rule of rules) {
    if (!pathMatches(rule.path, targetPath)) continue;
    const length = rule.path.replace(/[*$]/g, "").length;
    if (!winner || length > winner.length || (length === winner.length && rule.type === "allow")) {
      winner = { type: rule.type, length };
    }
  }

  return winner ? winner.type === "allow" : true;
}

function normalizeText(value) {
  return value
    .replace(/<script[\s\S]*?<\/script>/gi, " ")
    .replace(/<style[\s\S]*?<\/style>/gi, " ")
    .replace(/<[^>]*>/g, " ")
    .replace(/&amp;/g, "&")
    .replace(/&lt;/g, "<")
    .replace(/&gt;/g, ">")
    .replace(/&quot;/g, '"')
    .replace(/&#39;/g, "'")
    .replace(/\s+/g, " ")
    .trim();
}

function firstMatch(html, pattern) {
  const match = html.match(pattern);
  return match ? normalizeText(match[1]) : "";
}

function extractPageSummary(html) {
  const metaMatch =
    html.match(/<meta\s+[^>]*name=["']description["'][^>]*content=["']([^"']*)["'][^>]*>/i) ??
    html.match(/<meta\s+[^>]*content=["']([^"']*)["'][^>]*name=["']description["'][^>]*>/i);

  return {
    title: firstMatch(html, /<title[^>]*>([\s\S]*?)<\/title>/i),
    h1: firstMatch(html, /<h1[^>]*>([\s\S]*?)<\/h1>/i),
    metaDescription: metaMatch ? normalizeText(metaMatch[1]) : "",
    linkCount: [...html.matchAll(/<a\s+[^>]*href=["'][^"']+["']/gi)].length,
  };
}

function csvEscape(value) {
  const text = String(value ?? "");
  return /[",\n]/.test(text) ? `"${text.replace(/"/g, '""')}"` : text;
}

const robots = await loadRobots(targetUrl.origin);
if (!isAllowedByRobots(robots.text, targetUrl)) {
  throw new Error(`Blocked by robots.txt: ${targetUrl.toString()}`);
}

const response = await fetchText(targetUrl, "text/html");
if (!response.ok) {
  throw new Error(`Page fetch failed: HTTP ${response.status}`);
}

const html = await response.text();
const fetchedAt = new Date().toISOString();
const row = {
  sourceUrl: targetUrl.toString(),
  fetchedAt,
  ...extractPageSummary(html),
};
const headers = ["sourceUrl", "fetchedAt", "title", "h1", "metaDescription", "linkCount"];
const csv = [headers.join(","), headers.map((header) => csvEscape(row[header])).join(",")].join("\n");

await writeFile("scrape-output.csv", `${csv}\n`, "utf8");
await writeFile(
  "scrape-audit.json",
  JSON.stringify(
    {
      checkedAt: fetchedAt,
      userAgent: USER_AGENT,
      robotsUrl: robots.url,
      robotsStatus: robots.status,
      allowedOrigins,
      sourceUrl: row.sourceUrl,
    },
    null,
    2,
  ),
  "utf8",
);

console.log(`Saved scrape-output.csv for ${row.sourceUrl}`);

On PowerShell, run it like this after changing the domain to one you control or are allowed to check: $env:SCRAPE_URL="https://your-domain.example/page"; $env:ALLOWED_ORIGINS="https://your-domain.example"; node scrape-allowed-page.mjs. The output is intentionally boring: one CSV row plus one JSON audit file. That boring evidence is what makes the result reviewable.

Playwright for Owned or Local Pages

This Playwright example is for your own site or a local preview, not for bypassing another site’s protections. It checks that rendered selectors exist and writes an audit file. The allowlist keeps the browser check limited to localhost or an owned domain.

// check-own-site-selectors.mjs
import { writeFile } from "node:fs/promises";
import { chromium } from "playwright";

const target = process.env.LOCAL_PREVIEW_URL ?? "http://127.0.0.1:4321/blog/claude-code-web-scraping/";
const allowedPrefixes = [
  "http://127.0.0.1:",
  "http://localhost:",
  "https://claudecodelab.com/",
];

if (!allowedPrefixes.some((prefix) => target.startsWith(prefix))) {
  throw new Error(`Playwright check is limited to owned or local pages: ${target}`);
}

const browser = await chromium.launch();
const context = await browser.newContext({
  userAgent: "ClaudeCodeLabAuditBot/1.0 local-preview-check",
});
const page = await context.newPage();

await page.goto(target, { waitUntil: "domcontentloaded" });

const checks = [
  { name: "article title", selector: "main h1, article h1" },
  { name: "updated date", selector: "time, [data-updated-date]" },
  { name: "main article", selector: "main article, article" },
];
const results = [];

for (const check of checks) {
  const locator = page.locator(check.selector);
  const count = await locator.count();
  const firstText = count > 0 ? ((await locator.first().textContent()) ?? "").trim().slice(0, 120) : "";
  results.push({ ...check, count, firstText });
}

await writeFile(
  "selector-audit.json",
  JSON.stringify({ target, checkedAt: new Date().toISOString(), results }, null, 2),
  "utf8",
);

await context.close();
await browser.close();

const missing = results.filter((result) => result.count === 0);
if (missing.length > 0) {
  throw new Error(`Missing selectors: ${missing.map((result) => result.name).join(", ")}`);
}

console.log(`Saved selector-audit.json for ${target}`);

Browser contexts matter because they isolate cookies, localStorage, permissions, and other state. That is useful for tests and safer for automation. Do not point this at a real logged-in session unless the task is explicitly approved and the data is allowed to be processed.

A Prompt That Produces Safer Code

Use a prompt like this with Claude Code:

Add a one-page scraper for an allowlisted origin. First document whether an official API, RSS feed, or sitemap exists. Check robots.txt before fetching HTML. Save sourceUrl and fetchedAt in CSV. Do not collect email addresses, personal names, authenticated data, or secrets. Do not bypass CAPTCHA, login walls, bot protection, or rate limits. Add request throttling, stop on blocked paths, and show node --check results for the JavaScript files.

The point is to make Claude Code act like an auditable implementer. It should not decide the legal boundary for you. Your review should still inspect the diff, target URLs, saved fields, request cadence, deletion process, and sample output before anything runs on a schedule.

Operating Checklist

Prefer official APIs, RSS feeds, sitemaps, and exports before HTML scraping.
Confirm the page is public and the terms allow the intended use.
Respect robots.txt and record the check in an audit log.
Use an origin allowlist and small batches.
Add delay, limited retries, and stop-on-error behavior.
Use a clear User-Agent with a purpose or contact page.
Save source URL, fetch timestamp, method, and robots status.
Prefer semantic selectors or owner-controlled data attributes.
Do not store personal data, secrets, authenticated data, or raw HTML by default.
Review samples manually before using the output in business decisions.

Also consider CSV injection if outputs are opened in spreadsheets. Treat scraped strings as untrusted input. Escape fields, avoid formulas, and keep the storage location narrow. If this is part of a broader automation system, connect it to security review, content automation, and outreach controls rather than letting it silently feed a CRM.

Training and Consulting CTA

The code is the easy part. The hard part is deciding what not to collect, how to prove the collection was allowed, how to review changes, and how to delete stale output. ClaudeCodeLab can help teams turn this into CLAUDE.md rules, Playwright checks, CSV audit logs, and human approval steps through Claude Code training and consultation.

For solo use, start with one page, one run, public data only. Do not scale the page count until the audit trail, failure behavior, and review step are already in place.

Summary

Safe web scraping with Claude Code starts with boundaries, not selectors. Prefer APIs and sitemaps, then use Fetch for static allowed pages. Reserve Playwright for owned or approved dynamic pages. Always log source URL, timestamp, robots status, and user-agent details.

The avoid list is just as important: no terms violations, no blind email harvesting, no rate-limit bypass, no brittle silent failures, no protection evasion, and no sensitive data dumps. Claude Code can implement the mechanics quickly, but the workflow is only production-ready when a human can explain the purpose, source, timing, and deletion path.

After trying the workflow in this article, Masa found that limiting the first version to one page, one CSV row, and one JSON audit file made review much easier. sourceUrl and fetchedAt were especially useful when checking pricing pages and owned-site content. The prototypes that tried to collect many pages first had to be rewritten because selector failures and missing policy checks were harder to trace.

Safe Web Scraping with Claude Code: Fetch, Playwright, and Audit Logs

Define the Boundary Before Claude Code Writes Code

The Workflow

Fetch vs Playwright

Practical Use Cases

Common Failure Cases

Copy-Paste Fetch Scraper

Playwright for Owned or Local Pages

A Prompt That Produces Safer Code

Operating Checklist

Training and Consulting CTA

Summary

Free PDF: Claude Code Cheatsheet

Level up your Claude Code workflow

Related Posts

Claude Code Obsidian to CLAUDE.md Workflow: Stop Re-explaining Context

Claude Code Revenue CTA Routing: Send Articles to PDF, Gumroad, and Consultation

Claude Code Team Handoff Rules: Review Evidence, Permissions, Rollback, and Revenue Paths

Related Products

50 Battle-Tested Claude Code Prompt Templates

The Complete Claude Code Setup & Configuration Guide