Claude Code से Safe Web Scraping: Fetch, Playwright और Audit Logs

Code लिखने से पहले boundary तय करें

Web scraping का मतलब है web pages से information को software के जरिए पढ़ना। यह अपने site monitoring, public documentation URLs collect करने, pricing pages को manually review करने, और content QA में useful है। लेकिन public page दिख रहा है, इसका मतलब यह नहीं कि हर चीज़ collect कर सकते हैं। Claude Code implementation बहुत तेज कर देता है, इसलिए पहले boundary लिखना जरूरी है: केवल public data, terms और robots.txt का सम्मान, requests के बीच delay, lawful basis के बिना personal data नहीं, और हर run का audit log।

Beginner के लिए safe order है: पहले official API, RSS, sitemap, CSV export या documented data source खोजें। ये HTML से ज्यादा stable होते हैं और usage rules साफ होते हैं। HTML scraping तभी करें जब purpose legitimate हो, volume छोटा हो, और better structured source न मिले।

यह article Claude Code को implementation assistant की तरह use करता है, bypass tool की तरह नहीं। Login bypass, CAPTCHA bypass, anti-bot protection को evade करना, bulk email harvesting, या restricted data collection यहां cover नहीं है। अगर workflow personal information, sales outreach या regulated data से जुड़ता है, तो purpose, legal basis, privacy policy, retention, opt-out और local compliance पहले check करें।

Official references साथ रखें: robots.txt protocol के लिए RFC 9309, crawling behavior के लिए Google robots.txt docs, request model के लिए MDN Fetch API, और browser automation isolation के लिए Playwright Browser contexts।

Safe workflow

flowchart TD
  A["एक clear purpose"] --> B["API, RSS, sitemap check"]
  B --> C["Terms और robots.txt review"]
  C --> D{"Static HTML काफी है?"}
  D -->|हाँ| E["Fetch से one page at a time"]
  D -->|नहीं, approved| F["Playwright से rendered DOM check"]
  E --> G["CSV में URL और time"]
  F --> G
  G --> H["Human sample review"]

इस structure से Claude Code को दिया गया task clear रहता है। “इस site को scrape करो” की जगह लिखें: “इस allowlisted origin की एक page fetch करो, requests के बीच कम से कम 2 seconds wait करो, robots.txt block करे तो stop करो, और CSV में sourceUrl व fetchedAt लिखो।” Boundary जितनी clear होगी, unsafe loops, brittle selectors और unnecessary data उतना कम होगा।

Fetch और Playwright कब use करें

अगर required information HTML response में ही है, तो fetch use करें। Static docs, blog posts, public pricing pages, status pages और sitemap-based checks के लिए Fetch usually enough है। यह audit करना आसान है क्योंकि यह simple HTTP request है और text return करता है।

Playwright तभी use करें जब real browser जरूरी हो और page आपकी हो या automation के लिए approved हो। Local preview, staging, internal QA, या अपनी site का rendered DOM check इसके अच्छे cases हैं। Browser automation scripts, cookies, localStorage और permissions load करता है, इसलिए browser contexts अलग रखें। अलग context का मतलब है अलग working space, जिससे sessions mix नहीं होते।

Claude Code को पहले Fetch version बनाने को कहें। Playwright तभी जोड़ें जब static HTML से काम न चले। Review में fixed sleeps, accidental logged-in session, styling class पर depend selectors, missing rate limit और missing source metadata देखें।

Use cases

पहला use case है अपनी site monitoring। Training page, product page, form, article, canonical URL, title और CTA text check करें। अपनी site पर robots.txt और stable selectors आप control कर सकते हैं। इस output को AI content operations और content funnel audit से जोड़ें, ताकि stale CTA और broken links जल्दी पकड़े जाएँ।

दूसरा use case है public documentation URLs collect करना। Team official docs, internal handbook या public knowledge base का index बना सकती है। Full text copy करने की जरूरत अक्सर नहीं होती। URL, title, checked time और status enough हो सकते हैं।

तीसरा use case है competitor public pricing pages को manually reviewed तरीके से check करना। Low-volume monitoring से plan names या campaign text बदलने का संकेत मिल सकता है। लेकिन price output को direct truth न मानें। Region, tax, currency और promotion conditions बदल सकती हैं। हमेशा source URL और timestamp रखें, फिर human review करें।

चौथा use case lead research है, लेकिन guardrails जरूरी हैं। Company name, official website, industry और public contact page collect करना small scale में ठीक हो सकता है। Personal emails blindly collect करके outbound campaign में डालना unsafe है। Outreach करना हो तो opt-out, sender identity, suppression list और human approval रखें। Collection lawful और minimal हो तभी Claude Code email automation से जोड़ें।

Common pitfalls

सबसे बड़ा mistake है robots.txt और terms ignore करना। robots.txt पूरा legal permission नहीं है, लेकिन site की machine-readable boundary है। Terms automation, reuse या commercial monitoring को अलग से restrict कर सकते हैं।

दूसरा mistake है email दिखते ही save करना। Publicly visible personal data भी personal data हो सकता है। जरूरत नहीं है तो collect न करें। जरूरत है तो purpose, basis, retention, access, deletion और opt-out document करें।

Rate limit न होना भी risky है। Hundreds of requests बिना pause भेजना attack जैसा दिख सकता है। Small batches, low frequency, clear User-Agent, limited retries और stop-on-error use करें।

Brittle selectors silent data failure create करते हैं। .card > div:nth-child(2) आज चल सकता है और कल design change से टूट सकता है। Semantic HTML, time[datetime], main h1 या अपने data attributes prefer करें। Required selector missing हो तो job fail हो और diagnostic log बने।

Protections bypass करना feature नहीं है। Claude Code CAPTCHA workaround, login wall scraping, rotating identity या rate-limit bypass suggest करे तो task रोकें और approved data source खोजें।

Sensitive data store न करें। Raw HTML, authenticated data, tokens, customer information और personal records को unreviewed CSV में न डालें। Broader defensive setup के लिए Claude Code security best practices पढ़ें।

Copy-paste Fetch scraper

यह script Node 18 या newer पर चलता है। यह one allowed page fetch करता है, robots.txt को conservative तरीके से check करता है, delay लगाता है, basic fields extract करता है, और sourceUrl व fetchedAt के साथ CSV और JSON audit file लिखता है।

// scrape-allowed-page.mjs
import { writeFile } from "node:fs/promises";

const USER_AGENT = "ClaudeCodeLabAuditBot/1.0 (+https://example.com/bot-info)";
const BOT_TOKEN = "ClaudeCodeLabAuditBot";
const targetUrl = new URL(process.env.SCRAPE_URL ?? "https://example.com/");
const allowedOrigins = (process.env.ALLOWED_ORIGINS ?? "https://example.com")
  .split(",")
  .map((value) => new URL(value.trim()).origin);
const delayMs = Number.parseInt(process.env.REQUEST_DELAY_MS ?? "2000", 10);

if (!allowedOrigins.includes(targetUrl.origin)) {
  throw new Error(`Blocked by allowlist: ${targetUrl.origin}`);
}

function sleep(ms) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

async function fetchText(url, accept) {
  await sleep(delayMs);
  return fetch(url, {
    headers: {
      "user-agent": USER_AGENT,
      accept,
    },
  });
}

async function loadRobots(origin) {
  const robotsUrl = new URL("/robots.txt", origin);
  const response = await fetchText(robotsUrl, "text/plain");
  if (response.status === 404) {
    return { url: robotsUrl.toString(), status: response.status, text: null };
  }
  if (!response.ok) {
    throw new Error(`robots.txt check failed: HTTP ${response.status}`);
  }
  return {
    url: robotsUrl.toString(),
    status: response.status,
    text: await response.text(),
  };
}

function parseRobots(text) {
  const groups = [];
  let agents = [];
  let rules = [];

  function commit() {
    if (agents.length > 0) {
      groups.push({ agents, rules });
    }
    agents = [];
    rules = [];
  }

  for (const rawLine of text.split(/\r?\n/)) {
    const cleaned = rawLine.split("#")[0].trim();
    if (!cleaned) continue;
    const separator = cleaned.indexOf(":");
    if (separator === -1) continue;

    const field = cleaned.slice(0, separator).trim().toLowerCase();
    const value = cleaned.slice(separator + 1).trim();

    if (field === "user-agent") {
      if (rules.length > 0) commit();
      agents.push(value.toLowerCase());
      continue;
    }

    if ((field === "allow" || field === "disallow") && agents.length > 0) {
      rules.push({ type: field, path: value });
    }
  }

  commit();
  return groups;
}

function escapeRegExp(value) {
  return value.replace(/[.*+?^${}()|[\]\\]/g, "\\$&");
}

function pathMatches(pattern, path) {
  if (!pattern) return false;
  const exact = pattern.endsWith("$");
  const normalized = exact ? pattern.slice(0, -1) : pattern;
  const source = `^${escapeRegExp(normalized).replace(/\\\*/g, ".*")}${exact ? "$" : ""}`;
  return new RegExp(source).test(path);
}

function isAllowedByRobots(robotsText, url) {
  if (robotsText === null) {
    return process.env.ALLOW_WITHOUT_ROBOTS === "true";
  }

  const groups = parseRobots(robotsText);
  const bot = BOT_TOKEN.toLowerCase();
  const exactGroups = groups.filter((group) =>
    group.agents.some((agent) => agent !== "*" && bot.includes(agent)),
  );
  const fallbackGroups = groups.filter((group) => group.agents.includes("*"));
  const selectedGroups = exactGroups.length > 0 ? exactGroups : fallbackGroups;
  const rules = selectedGroups.flatMap((group) => group.rules);
  const targetPath = `${url.pathname}${url.search}`;
  let winner = null;

  for (const rule of rules) {
    if (!pathMatches(rule.path, targetPath)) continue;
    const length = rule.path.replace(/[*$]/g, "").length;
    if (!winner || length > winner.length || (length === winner.length && rule.type === "allow")) {
      winner = { type: rule.type, length };
    }
  }

  return winner ? winner.type === "allow" : true;
}

function normalizeText(value) {
  return value
    .replace(/<script[\s\S]*?<\/script>/gi, " ")
    .replace(/<style[\s\S]*?<\/style>/gi, " ")
    .replace(/<[^>]*>/g, " ")
    .replace(/&amp;/g, "&")
    .replace(/&lt;/g, "<")
    .replace(/&gt;/g, ">")
    .replace(/&quot;/g, '"')
    .replace(/&#39;/g, "'")
    .replace(/\s+/g, " ")
    .trim();
}

function firstMatch(html, pattern) {
  const match = html.match(pattern);
  return match ? normalizeText(match[1]) : "";
}

function extractPageSummary(html) {
  const metaMatch =
    html.match(/<meta\s+[^>]*name=["']description["'][^>]*content=["']([^"']*)["'][^>]*>/i) ??
    html.match(/<meta\s+[^>]*content=["']([^"']*)["'][^>]*name=["']description["'][^>]*>/i);

  return {
    title: firstMatch(html, /<title[^>]*>([\s\S]*?)<\/title>/i),
    h1: firstMatch(html, /<h1[^>]*>([\s\S]*?)<\/h1>/i),
    metaDescription: metaMatch ? normalizeText(metaMatch[1]) : "",
    linkCount: [...html.matchAll(/<a\s+[^>]*href=["'][^"']+["']/gi)].length,
  };
}

function csvEscape(value) {
  const text = String(value ?? "");
  return /[",\n]/.test(text) ? `"${text.replace(/"/g, '""')}"` : text;
}

const robots = await loadRobots(targetUrl.origin);
if (!isAllowedByRobots(robots.text, targetUrl)) {
  throw new Error(`Blocked by robots.txt: ${targetUrl.toString()}`);
}

const response = await fetchText(targetUrl, "text/html");
if (!response.ok) {
  throw new Error(`Page fetch failed: HTTP ${response.status}`);
}

const html = await response.text();
const fetchedAt = new Date().toISOString();
const row = {
  sourceUrl: targetUrl.toString(),
  fetchedAt,
  ...extractPageSummary(html),
};
const headers = ["sourceUrl", "fetchedAt", "title", "h1", "metaDescription", "linkCount"];
const csv = [headers.join(","), headers.map((header) => csvEscape(row[header])).join(",")].join("\n");

await writeFile("scrape-output.csv", `${csv}\n`, "utf8");
await writeFile(
  "scrape-audit.json",
  JSON.stringify(
    {
      checkedAt: fetchedAt,
      userAgent: USER_AGENT,
      robotsUrl: robots.url,
      robotsStatus: robots.status,
      allowedOrigins,
      sourceUrl: row.sourceUrl,
    },
    null,
    2,
  ),
  "utf8",
);

console.log(`Saved scrape-output.csv for ${row.sourceUrl}`);

PowerShell में चलाने का example: $env:SCRAPE_URL="https://your-domain.example/page"; $env:ALLOWED_ORIGINS="https://your-domain.example"; node scrape-allowed-page.mjs। Output deliberately simple है: one CSV row और one JSON audit file।

Playwright सिर्फ own/local pages के लिए

यह Playwright example आपकी own site या local preview पर rendered selectors check करता है। यह external protections bypass करने के लिए नहीं है।

// check-own-site-selectors.mjs
import { writeFile } from "node:fs/promises";
import { chromium } from "playwright";

const target = process.env.LOCAL_PREVIEW_URL ?? "http://127.0.0.1:4321/blog/claude-code-web-scraping/";
const allowedPrefixes = [
  "http://127.0.0.1:",
  "http://localhost:",
  "https://claudecodelab.com/",
];

if (!allowedPrefixes.some((prefix) => target.startsWith(prefix))) {
  throw new Error(`Playwright check is limited to owned or local pages: ${target}`);
}

const browser = await chromium.launch();
const context = await browser.newContext({
  userAgent: "ClaudeCodeLabAuditBot/1.0 local-preview-check",
});
const page = await context.newPage();

await page.goto(target, { waitUntil: "domcontentloaded" });

const checks = [
  { name: "article title", selector: "main h1, article h1" },
  { name: "updated date", selector: "time, [data-updated-date]" },
  { name: "main article", selector: "main article, article" },
];
const results = [];

for (const check of checks) {
  const locator = page.locator(check.selector);
  const count = await locator.count();
  const firstText = count > 0 ? ((await locator.first().textContent()) ?? "").trim().slice(0, 120) : "";
  results.push({ ...check, count, firstText });
}

await writeFile(
  "selector-audit.json",
  JSON.stringify({ target, checkedAt: new Date().toISOString(), results }, null, 2),
  "utf8",
);

await context.close();
await browser.close();

const missing = results.filter((result) => result.count === 0);
if (missing.length > 0) {
  throw new Error(`Missing selectors: ${missing.map((result) => result.name).join(", ")}`);
}

console.log(`Saved selector-audit.json for ${target}`);

Browser contexts cookies, localStorage और permissions isolate करते हैं। Real logged-in session को automation में न दें, जब तक task approved और data processing clear न हो।

Claude Code prompt

Claude Code को इस तरह prompt दें:

Allowlisted origin के लिए one-page scraper जोड़ें। पहले README में official API, RSS या sitemap की availability लिखें। HTML fetch से पहले robots.txt check करें। CSV में sourceUrl और fetchedAt mandatory रखें। Emails, personal names, authenticated data या secrets collect न करें। CAPTCHA, login wall, bot protection या rate limit bypass न करें। Request throttling, blocked path पर stop, और JavaScript files के लिए node --check result दिखाएँ।

इसका purpose है Claude Code को auditable implementer बनाना। Legal boundary का decision human करता है। Scheduled run से पहले diff, target URLs, saved fields, cadence, deletion process और sample output review करें।

Operational checklist

API, RSS, sitemap या export पहले check करें।
Page public है और intended use terms से conflict नहीं करता।
robots.txt respect करें और audit log में लिखें।
Origin allowlist और small batches use करें।
Delay, limited retries और stop-on-error रखें।
Clear User-Agent use करें।
source URL, fetch timestamp, method और robots status save करें।
Semantic selectors या own data attributes prefer करें।
Personal data, secrets, session data या raw HTML default में store न करें।
Business use से पहले human sample review करें।

अगर CSV spreadsheet में खुलेगा, तो CSV injection ध्यान रखें। Web से आया text untrusted input है। Scraper को security review, content automation और outreach controls से जोड़ें, silent CRM import से नहीं।

Training और consulting

Code आसान हिस्सा है। असली काम है क्या collect नहीं करना, allowed collection को कैसे prove करना, changes कैसे review करना, और stale output कैसे delete करना। ClaudeCodeLab Claude Code training और consultation में CLAUDE.md rules, Playwright checks, CSV audit logs और human approvals को team workflow में बदलने में मदद कर सकता है।

Solo start के लिए one page, one run, public data only रखें। Volume बढ़ाने से पहले audit trail, failure behavior और review step ready करें।

Summary

Safe web scraping with Claude Code selectors से नहीं, boundaries से शुरू होता है। APIs और sitemaps को priority दें, static allowed pages के लिए Fetch use करें, और Playwright सिर्फ own या approved dynamic pages पर use करें। URL, time, robots status और User-Agent हमेशा log करें।

Avoid list भी important है: terms ignore न करें, emails blindly collect न करें, rate limits bypass न करें, selector failures silent न रखें, protections bypass न करें, sensitive data dump न करें। Claude Code implementation fast करता है, लेकिन workflow तभी production-ready है जब human purpose, source, time और deletion path explain कर सके।

Masa ने इस workflow को test किया तो one page, one CSV row और one JSON audit file वाली first version review करना सबसे आसान लगा। sourceUrl और fetchedAt pricing pages और own-site checks को बाद में explain करने में बहुत useful रहे। जो prototypes शुरू से कई pages collect कर रहे थे, उनमें selector failures और policy gaps trace करना मुश्किल था, इसलिए उन्हें rewrite करना पड़ा।