Claude Code से logging और monitoring: production observability guide

पहले संचालन नियम तय करें

अगर आप Claude Code से सिर्फ “logging add करो” कहते हैं, तो अक्सर कुछ console.log बढ़ जाते हैं। Local debugging में यह ठीक लगता है, लेकिन production incident में यह काफी नहीं है। अच्छी logging और monitoring यह बताती है कि कौन सा user action आया, कौन सा requestId पूरे flow में गया, कौन सी service धीमी हुई, कौन सी dependency failed हुई, और अगले on-call engineer को कौन सा data सुरक्षित रूप से देना है।

OpenTelemetry Observability primer observability को system को बाहर से समझने और नए सवालों का जवाब देने की क्षमता बताता है। What is OpenTelemetry? यह भी साफ करता है कि OpenTelemetry backend नहीं, बल्कि vendor-neutral instrumentation layer है। इस लेख में वही बात Claude Code prompts, copy-paste code, alert rules, dashboard review और incident handoff तक लाई गई है।

Masa ने एक छोटे checkout API पर यह flow आजमाया। पहला prompt था: “logs detailed करो”। Claude Code ने payment failure पर पूरा request body log करने का सुझाव दिया। Card number नहीं था, लेकिन email, coupon, address और customer note leak हो सकते थे। जब prompt में पहले forbidden fields लिखे गए, requestId और traceparent को correlation key बनाया गया, और redaction test मांगा गया, तब review काफी साफ हो गया।

Signal	सवाल	Claude Code को rule
Logs	क्या हुआ	JSON, fixed fields, PII redaction
Metrics	कितना खराब है	rate, p95, error rate
Traces	समय कहां गया	`traceparent` propagate करें
Health checks	dependency usable है या नहीं	हर dependency की status और latency

Safe prompt और CLAUDE.md

Claude Code memory docs बताते हैं कि CLAUDE.md session की शुरुआत में project, user या organization instructions देता है। Observability के लिए इसमें log levels, field names, forbidden fields, test commands, dashboard names और incident handoff format रखें। Internal guide CLAUDE.md best practices भी देखें। Tool boundaries के लिए official Claude Code permissions और hooks काम आते हैं।

Claude Code task:
- Add observability to the checkout API only.
- Keep all changes inside src/checkout and tests/checkout.
- Use structured JSON logs with requestId and traceparent.
- Never log passwords, tokens, cookies, email, phone, address,
  raw prompt text, or full request/response bodies.
- Add tests proving redaction and requestId propagation.
- Add a /healthz report with database and cache latency.
- Add alert rules for 5xx rate, p95 latency, and redaction failure.
- Show a diff summary and remaining manual checks at the end.

Prompt छोटा और सीमित है। यह Claude Code को unrelated route बदलने, नया monitoring vendor चुनने या raw production logs prompt में डालने से रोकता है। मकसद ऐसा operational diff बनाना है जिसे review, test और rollback किया जा सके।

Structured logs और request ID

OWASP Logging Cheat Sheet logging को security feature की तरह देखता है। Log injection, access control, disk exhaustion और logging failure भी test करने चाहिए। इसलिए Claude Code से सिर्फ message नहीं, बल्कि redaction test, request ID propagation और failure behavior मांगें।

नीचे dependency-free JSON logger है। इसे structured-logger.mjs के रूप में save करें और Node.js 18+ पर चलाएं।

import { randomUUID } from "node:crypto";

const rank = { debug: 10, info: 20, warn: 30, error: 40 };
const current = process.env.LOG_LEVEL || "info";
const threshold = rank[current] ?? rank.info;

const secretKeys = [
  "password",
  "token",
  "authorization",
  "cookie",
  "set-cookie",
  "apikey",
];

function cleanText(value) {
  return String(value).replace(/[\r\n\t]/g, " ").slice(0, 500);
}

function redact(value) {
  if (Array.isArray(value)) return value.map(redact);
  if (!value || typeof value !== "object") return value;

  return Object.fromEntries(
    Object.entries(value).map(([key, item]) => {
      if (secretKeys.includes(key.toLowerCase())) {
        return [key, "[REDACTED]"];
      }
      return [key, redact(item)];
    }),
  );
}

export function log(level, message, fields = {}) {
  if ((rank[level] ?? 99) < threshold) return;

  const entry = {
    ts: new Date().toISOString(),
    level,
    service: process.env.SERVICE_NAME || "checkout-api",
    env: process.env.NODE_ENV || "development",
    requestId: fields.requestId || randomUUID(),
    msg: cleanText(message),
    ...redact(fields),
  };

  process.stdout.write(`${JSON.stringify(entry)}\n`);
}

log("info", "payment accepted", {
  requestId: "req_demo_001",
  userId: "user_123",
  amount: 4980,
  token: "sk_live_should_not_leak",
});

Web app में request ID response header में भी लौटाएं और async context में रखें। W3C Trace Context बताता है कि traceparent कैसे बनाया या propagate किया जाता है। x-request-id को app correlation ID और traceparent को distributed tracing carrier मानें।

import { AsyncLocalStorage } from "node:async_hooks";
import { randomUUID } from "node:crypto";
import type { Request, Response, NextFunction } from "express";
import { log } from "./structured-logger";

type RequestContext = {
  requestId: string;
  traceparent?: string;
  userId?: string;
};

const storage = new AsyncLocalStorage<RequestContext>();

export function getRequestContext() {
  return storage.getStore();
}

export function requestContext(
  req: Request,
  res: Response,
  next: NextFunction,
) {
  const started = performance.now();
  const user = (req as Request & { user?: { id?: string } }).user;
  const requestId =
    req.get("x-request-id") ||
    req.get("cf-ray") ||
    randomUUID();

  const context = {
    requestId,
    traceparent: req.get("traceparent"),
    userId: user?.id,
  };

  res.setHeader("x-request-id", requestId);

  storage.run(context, () => {
    res.on("finish", () => {
      const durationMs = Math.round(performance.now() - started);
      const level = res.statusCode >= 500
        ? "error"
        : res.statusCode >= 400
          ? "warn"
          : "info";

      log(level, "http request completed", {
        requestId,
        method: req.method,
        path: req.path,
        statusCode: res.statusCode,
        durationMs,
      });
    });

    next();
  });
}

Log levels को boring रखें: debug local detail, info normal important event, warn recoverable risk, और error human investigation के लिए।

OpenTelemetry basics

OpenTelemetry app signals को standard बनाता है, फिर उन्हें आपकी team के backend तक भेजता है। Node.js setup के लिए official JavaScript Node.js guide और JavaScript exporters guide देखें।

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-proto \
  @opentelemetry/exporter-metrics-otlp-proto \
  @opentelemetry/sdk-metrics

const opentelemetry = require("@opentelemetry/sdk-node");
const {
  getNodeAutoInstrumentations,
} = require("@opentelemetry/auto-instrumentations-node");
const {
  OTLPTraceExporter,
} = require("@opentelemetry/exporter-trace-otlp-proto");
const {
  OTLPMetricExporter,
} = require("@opentelemetry/exporter-metrics-otlp-proto");
const {
  PeriodicExportingMetricReader,
} = require("@opentelemetry/sdk-metrics");

process.env.OTEL_SERVICE_NAME ||= "checkout-api";

const endpoint =
  process.env.OTEL_EXPORTER_OTLP_ENDPOINT ||
  "http://localhost:4318";

const sdk = new opentelemetry.NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: `${endpoint}/v1/traces`,
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: `${endpoint}/v1/metrics`,
    }),
    exportIntervalMillis: 30000,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

process.on("SIGTERM", () => {
  sdk.shutdown().finally(() => process.exit(0));
});

flowchart LR
  A["User action"] --> B["Application"]
  B --> C["Structured log"]
  B --> D["Metric"]
  B --> E["Trace span"]
  C --> F["Log store"]
  D --> G["Alert rule"]
  E --> H["Trace backend"]
  F --> I["Incident handoff"]
  G --> I
  H --> I

Health checks और alerts

अच्छा health check सिर्फ 200 OK नहीं होता। वह database, cache, queue और external API को अलग-अलग check करता है, latency दिखाता है और कोई secret नहीं दिखाता।

function timeout(ms) {
  return new Promise((_, reject) => {
    setTimeout(() => reject(new Error("timeout")), ms);
  });
}

export async function buildHealthReport(checks) {
  const started = Date.now();
  const results = {};

  for (const [name, check] of Object.entries(checks)) {
    const before = Date.now();
    try {
      await Promise.race([check(), timeout(800)]);
      results[name] = {
        status: "ok",
        latencyMs: Date.now() - before,
      };
    } catch (error) {
      const message =
        error instanceof Error ? error.message : String(error);
      results[name] = {
        status: "fail",
        latencyMs: Date.now() - before,
        reason: message.slice(0, 120),
      };
    }
  }

  const failed = Object.values(results)
    .filter((item) => item.status === "fail")
    .length;

  return {
    status: failed ? "degraded" : "ok",
    uptimeSec: Math.round(process.uptime()),
    totalLatencyMs: Date.now() - started,
    checks: results,
  };
}

Alerts को single error log पर नहीं, बल्कि time window के rates और percentiles पर बनाएं।

groups:
  - name: checkout-api
    rules:
      - alert: CheckoutHigh5xxRate
        expr: |
          sum(rate(http_requests_total{
            service="checkout-api",
            status_code=~"5.."
          }[5m]))
          /
          sum(rate(http_requests_total{
            service="checkout-api"
          }[5m])) > 0.02
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "Checkout 5xx rate is above 2%"

      - alert: CheckoutP95LatencyHigh
        expr: |
          histogram_quantile(
            0.95,
            sum by (le) (
              rate(http_request_duration_seconds_bucket{
                service="checkout-api"
              }[5m])
            )
          ) > 1.5
        for: 15m
        labels:
          severity: ticket
        annotations:
          summary: "Checkout p95 latency is above 1.5s"

तीन concrete use cases

पहला use case ecommerce checkout API है। orderId, requestId, paymentProvider और amount रखें, लेकिन card data, email, address और token न रखें। Alerts को 5xx rate, payment failure rate और provider p95 latency में अलग करें।

दूसरा use case SaaS admin dashboard है। Login, permission change, member invite और plan change audit logs में जाएं। Invitation email body और private notes log में न जाएं। Claude Code से audit logs और app logs अलग करने, actor ID और target user ID अलग रखने, और RBAC tests जोड़ने को कहें।

तीसरा use case media site या blog CMS है। Publish event, CTA click, lead form success, image generation failure और missing translations track करें। Page views अकेले revenue नहीं सुधारते। cta_click और generate_lead अलग रखें, और analytics implementation guide के साथ dashboard देखें।

Microservices हों तो microservices guide भी पढ़ें। service.name और environment labels drift करने लगें तो OpenTelemetry data खोजना मुश्किल हो जाता है।

Failure modes और incident handoff

Common failures हैं: पूरा request body log करना, free-form message लिखना, कई correlation IDs बनाना, /healthz में dependencies न check करना, और alerts का owner तय न करना। Claude Code में extra risk यह है कि raw production logs prompt में paste हो सकते हैं। पहले logs mask करें, metrics aggregate करें, और सिर्फ short request IDs या trace IDs दें।

{
  "incident_id": "INC-2026-06-02-001",
  "severity": "SEV2",
  "owner": "oncall-api",
  "customer_impact": "Checkout errors for some card payments",
  "first_seen": "2026-06-02T09:15:00+09:00",
  "request_ids": ["req_7f3a", "req_8b21"],
  "trace_ids": ["7bba9f33312b3dbb8b2c2c62bb7abe2d"],
  "dashboards": ["Checkout API overview"],
  "current_hypothesis": "Payment provider latency spike",
  "actions_taken": ["Disabled checkout_v2 feature flag"],
  "next_checks": ["Compare p95 latency by region"],
  "do_not_do": ["Do not paste raw customer data into prompts"]
}

Read-only investigation rules के लिए permissions guide भी उपयोगी है।

Dashboard review, CTA और verification

हर सप्ताह dashboard देखें: top five errors, p95 latency, log volume growth, false alerts, redaction failures और open incidents। हर महीने एक real incident लेकर देखें कि logs, metrics और traces से नई on-call person जल्दी root cause तक पहुंचती या नहीं।

Solo developer पहले free Claude Code cheatsheet से daily checks तय कर सकता है। Reusable prompts और setup material के लिए products page देखें। Team को logging standards, CLAUDE.md, permissions, CI और incident workflow real repository पर design करना हो तो Claude Code training and consultation सही next step है।

इस flow को आजमाने पर सबसे बड़ा फायदा forbidden fields पहले लिखने से मिला। structured-logger.mjs ने token को [REDACTED] किया और newline को एक line में बदला। Cache failure simulate करने पर health report degraded हुआ; handoff JSON में requestId और traceId रखने से review discussion छोटी हो गई।

Claude Code से logging और monitoring: production observability guide

पहले संचालन नियम तय करें

Safe prompt और CLAUDE.md

Structured logs और request ID

OpenTelemetry basics

Health checks और alerts

तीन concrete use cases

Failure modes और incident handoff

Dashboard review, CTA और verification

मुफ़्त PDF: Claude Code cheatsheet

संबंधित लेख

Claude Code Permission Receipt Pattern: scope, proof और rollback लिखना

Claude Code और Codex के लिए सुरक्षित Agent Harness: permissions, verification और rollback

Claude Code Subagents गाइड: article और code work को सुरक्षित तरीके से delegate करें