Claude Code Logging and Monitoring: Practical Observability Guide

Start With The Operating Contract

If you ask Claude Code to “add logging”, you will often get more console.log calls. That is useful during a local experiment, but weak during a production incident. A real logging and monitoring baseline must answer: which user action triggered the request, which request ID followed it, which service slowed down, which dependency failed, and what can be shared safely with the next engineer.

The OpenTelemetry observability primer frames observability as understanding a system from the outside and answering new questions without already knowing the internals. The OpenTelemetry overview also makes a practical point: OpenTelemetry is vendor-neutral instrumentation, not the backend itself. This article turns that idea into Claude Code prompts, code snippets, dashboards, alert rules, and review habits.

Masa’s first test on a small checkout API failed in a useful way. The prompt said “make logs detailed”, so Claude Code proposed logging the request body around payment failures. The code did not include card numbers, but it could still leak email addresses, coupons, addresses, and raw customer notes. The better prompt defined forbidden fields first, required requestId and traceparent, and asked for tests proving redaction. The review changed from “does the code look plausible?” to “can this survive production data?”

Signal	Main question	Claude Code constraint
Logs	What happened?	JSON fields, fixed names, PII redaction
Metrics	How bad is it?	rates, p95 latency, error budget signals
Traces	Where did time go?	propagate `traceparent` and name spans
Health checks	Is a dependency usable?	status and latency per dependency

Safe Prompt And CLAUDE.md Rules

Claude Code’s memory documentation explains that CLAUDE.md gives project, user, or organization instructions that Claude reads at the start of sessions. For observability, put durable rules there: log levels, field names, forbidden fields, test commands, dashboards, and incident handoff format. Pair this with the internal CLAUDE.md best practices guide and the official Claude Code permissions and hooks docs when tool boundaries matter.

Claude Code task:
- Add observability to the checkout API only.
- Keep all changes inside src/checkout and tests/checkout.
- Use structured JSON logs with requestId and traceparent.
- Never log passwords, tokens, cookies, email, phone, address,
  raw prompt text, or full request/response bodies.
- Add tests proving redaction and requestId propagation.
- Add a /healthz report with database and cache latency.
- Add alert rules for 5xx rate, p95 latency, and redaction failure.
- Show a diff summary and remaining manual checks at the end.

This prompt is intentionally narrow. It does not ask Claude Code to invent a monitoring platform, change unrelated routes, or paste raw production logs into the conversation. It asks for a reviewable operational diff. In a team rollout, add the same rules to CLAUDE.md so future edits keep the same field names and safety boundaries.

Structured Logs And Correlation IDs

The OWASP Logging Cheat Sheet treats logging as a security feature that must be tested, reviewed, and protected. It also calls out log injection, access control, disk exhaustion, and logging-system failures. That maps directly to Claude Code work: ask it to add redaction tests, failure tests, and stable schemas, not just text messages.

The first snippet is a dependency-free JSON logger. Save it as structured-logger.mjs and run it with Node.js 18 or newer.

import { randomUUID } from "node:crypto";

const rank = { debug: 10, info: 20, warn: 30, error: 40 };
const current = process.env.LOG_LEVEL || "info";
const threshold = rank[current] ?? rank.info;

const secretKeys = [
  "password",
  "token",
  "authorization",
  "cookie",
  "set-cookie",
  "apikey",
];

function cleanText(value) {
  return String(value).replace(/[\r\n\t]/g, " ").slice(0, 500);
}

function redact(value) {
  if (Array.isArray(value)) return value.map(redact);
  if (!value || typeof value !== "object") return value;

  return Object.fromEntries(
    Object.entries(value).map(([key, item]) => {
      if (secretKeys.includes(key.toLowerCase())) {
        return [key, "[REDACTED]"];
      }
      return [key, redact(item)];
    }),
  );
}

export function log(level, message, fields = {}) {
  if ((rank[level] ?? 99) < threshold) return;

  const entry = {
    ts: new Date().toISOString(),
    level,
    service: process.env.SERVICE_NAME || "checkout-api",
    env: process.env.NODE_ENV || "development",
    requestId: fields.requestId || randomUUID(),
    msg: cleanText(message),
    ...redact(fields),
  };

  process.stdout.write(`${JSON.stringify(entry)}\n`);
}

log("info", "payment accepted", {
  requestId: "req_demo_001",
  userId: "user_123",
  amount: 4980,
  token: "sk_live_should_not_leak",
});

For web apps, return the request ID to the caller and keep it in async context. The W3C Trace Context processing model explains how traceparent is created or propagated across systems. Use x-request-id as your application correlation ID and traceparent as the distributed tracing carrier.

import { AsyncLocalStorage } from "node:async_hooks";
import { randomUUID } from "node:crypto";
import type { Request, Response, NextFunction } from "express";
import { log } from "./structured-logger";

type RequestContext = {
  requestId: string;
  traceparent?: string;
  userId?: string;
};

const storage = new AsyncLocalStorage<RequestContext>();

export function getRequestContext() {
  return storage.getStore();
}

export function requestContext(
  req: Request,
  res: Response,
  next: NextFunction,
) {
  const started = performance.now();
  const user = (req as Request & { user?: { id?: string } }).user;
  const requestId =
    req.get("x-request-id") ||
    req.get("cf-ray") ||
    randomUUID();

  const context = {
    requestId,
    traceparent: req.get("traceparent"),
    userId: user?.id,
  };

  res.setHeader("x-request-id", requestId);

  storage.run(context, () => {
    res.on("finish", () => {
      const durationMs = Math.round(performance.now() - started);
      const level = res.statusCode >= 500
        ? "error"
        : res.statusCode >= 400
          ? "warn"
          : "info";

      log(level, "http request completed", {
        requestId,
        method: req.method,
        path: req.path,
        statusCode: res.statusCode,
        durationMs,
      });
    });

    next();
  });
}

Keep level semantics boring: debug is temporary local detail, info is a meaningful normal event, warn is recoverable risk, and error requires investigation. Claude Code should not change control flow because a log call failed.

OpenTelemetry Basics

OpenTelemetry is where logs, metrics, and traces become easier to correlate. Use it to standardize how telemetry leaves your app, then send that data to the backend your team already uses. The official JavaScript Node.js guide and JavaScript exporters guide are the primary references for the Node setup below.

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-proto \
  @opentelemetry/exporter-metrics-otlp-proto \
  @opentelemetry/sdk-metrics

const opentelemetry = require("@opentelemetry/sdk-node");
const {
  getNodeAutoInstrumentations,
} = require("@opentelemetry/auto-instrumentations-node");
const {
  OTLPTraceExporter,
} = require("@opentelemetry/exporter-trace-otlp-proto");
const {
  OTLPMetricExporter,
} = require("@opentelemetry/exporter-metrics-otlp-proto");
const {
  PeriodicExportingMetricReader,
} = require("@opentelemetry/sdk-metrics");

process.env.OTEL_SERVICE_NAME ||= "checkout-api";

const endpoint =
  process.env.OTEL_EXPORTER_OTLP_ENDPOINT ||
  "http://localhost:4318";

const sdk = new opentelemetry.NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: `${endpoint}/v1/traces`,
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: `${endpoint}/v1/metrics`,
    }),
    exportIntervalMillis: 30000,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

process.on("SIGTERM", () => {
  sdk.shutdown().finally(() => process.exit(0));
});

flowchart LR
  A["User action"] --> B["Application"]
  B --> C["Structured log"]
  B --> D["Metric"]
  B --> E["Trace span"]
  C --> F["Log store"]
  D --> G["Alert rule"]
  E --> H["Trace backend"]
  F --> I["Incident handoff"]
  G --> I
  H --> I

Health Checks And Alert Rules

A useful health check is not just 200 OK. It should check the database, cache, queue, and critical external APIs separately, return latency per dependency, and avoid secrets. It should be cheap enough for load balancers and clear enough for humans.

function timeout(ms) {
  return new Promise((_, reject) => {
    setTimeout(() => reject(new Error("timeout")), ms);
  });
}

export async function buildHealthReport(checks) {
  const started = Date.now();
  const results = {};

  for (const [name, check] of Object.entries(checks)) {
    const before = Date.now();
    try {
      await Promise.race([check(), timeout(800)]);
      results[name] = {
        status: "ok",
        latencyMs: Date.now() - before,
      };
    } catch (error) {
      const message =
        error instanceof Error ? error.message : String(error);
      results[name] = {
        status: "fail",
        latencyMs: Date.now() - before,
        reason: message.slice(0, 120),
      };
    }
  }

  const failed = Object.values(results)
    .filter((item) => item.status === "fail")
    .length;

  return {
    status: failed ? "degraded" : "ok",
    uptimeSec: Math.round(process.uptime()),
    totalLatencyMs: Date.now() - started,
    checks: results,
  };
}

Alert on rates and percentiles, not one noisy log line. This Prometheus-style example separates paging alerts from ticket alerts.

groups:
  - name: checkout-api
    rules:
      - alert: CheckoutHigh5xxRate
        expr: |
          sum(rate(http_requests_total{
            service="checkout-api",
            status_code=~"5.."
          }[5m]))
          /
          sum(rate(http_requests_total{
            service="checkout-api"
          }[5m])) > 0.02
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "Checkout 5xx rate is above 2%"

      - alert: CheckoutP95LatencyHigh
        expr: |
          histogram_quantile(
            0.95,
            sum by (le) (
              rate(http_request_duration_seconds_bucket{
                service="checkout-api"
              }[5m])
            )
          ) > 1.5
        for: 15m
        labels:
          severity: ticket
        annotations:
          summary: "Checkout p95 latency is above 1.5s"

Three Practical Use Cases

The first use case is an ecommerce checkout API. Keep orderId, requestId, paymentProvider, and amount, but never log card data, email, address, or access tokens. Alerts should separate 5xx rate, payment failure rate, and provider latency. During an incident, logs identify the order, traces show the payment-provider call, and metrics show the blast radius.

The second use case is a SaaS admin dashboard. Login, permission changes, member invitations, and plan changes belong in audit logs. Invitation email bodies and private notes do not. Ask Claude Code to separate audit logs from application logs, store actor ID and target user ID as different fields, and add RBAC tests.

The third use case is a media site or blog CMS. Track publish events, CTA clicks, lead-form success, image-generation failures, and missing translations. Page views alone do not improve revenue. Use separate cta_click and generate_lead events, then pair this article with the internal analytics implementation guide.

If your system is split into services, read the microservices guide as well. OpenTelemetry becomes frustrating when service.name, route names, or deployment labels drift between services.

Failure Modes And Incident Handoff

The common failures are predictable. Teams log whole request bodies because they want context. They use free-form messages that cannot be queried. They create requestId, correlationId, traceId, and transactionId without deciding which one is authoritative. Their /healthz endpoint always says ok because it never checks dependencies. They create alerts but never decide who owns the dashboard.

Claude Code adds one more risk: raw production logs can become prompt input. Before asking for incident help, mask logs, aggregate metrics, and share only short trace IDs or request IDs. The Claude Code permissions guide is a good companion when you need read-only investigation habits.

{
  "incident_id": "INC-2026-06-02-001",
  "severity": "SEV2",
  "owner": "oncall-api",
  "customer_impact": "Checkout errors for some card payments",
  "first_seen": "2026-06-02T09:15:00+09:00",
  "request_ids": ["req_7f3a", "req_8b21"],
  "trace_ids": ["7bba9f33312b3dbb8b2c2c62bb7abe2d"],
  "dashboards": ["Checkout API overview"],
  "current_hypothesis": "Payment provider latency spike",
  "actions_taken": ["Disabled checkout_v2 feature flag"],
  "next_checks": ["Compare p95 latency by region"],
  "do_not_do": ["Do not paste raw customer data into prompts"]
}

Dashboard Review, CTA, And Verification

Review dashboards weekly. Look at the top five errors, p95 latency, log-volume growth, false-positive alerts, redaction failures, and unresolved incidents. Once a month, pick a real incident and ask whether the logs, metrics, and traces would have led a new on-call engineer to the cause faster.

For a quick personal baseline, start with the free Claude Code cheatsheet. For reusable prompts and setup material, use the products page. If your team needs help designing logging standards, CLAUDE.md, permissions, CI checks, and incident workflows against a real repository, use Claude Code training and consultation.

After trying the workflow in this article, the biggest improvement came from writing forbidden fields before writing any logger code. The structured-logger.mjs sample redacted the token and normalized newlines into one log line. A simulated cache failure changed the health report to degraded, and a short handoff JSON with requestId and traceId made review discussion noticeably shorter.

Claude Code Logging and Monitoring: Practical Observability Guide

Start With The Operating Contract

Safe Prompt And CLAUDE.md Rules

Structured Logs And Correlation IDs

OpenTelemetry Basics

Health Checks And Alert Rules

Three Practical Use Cases

Failure Modes And Incident Handoff

Dashboard Review, CTA, And Verification

Free PDF: Claude Code Cheatsheet

Level up your Claude Code workflow

Related Posts

Claude Code Permission Receipt Pattern: Record Scope, Proof, and Rollback

Safe Agent Harness Design for Claude Code and Codex: Permissions, Checks, and Rollback

Claude Code Subagents: A Practical Guide to Safe Agent Delegation

Related Products

50 Battle-Tested Claude Code Prompt Templates

The Complete Claude Code Setup & Configuration Guide