Claude Code × AWS CloudWatch: लॉग, मेट्रिक्स, अलार्म, डैशबोर्ड और incident review

Production incident में सबसे बड़ी दिक्कत हमेशा यह नहीं होती कि logs नहीं हैं। अक्सर logs बहुत ज्यादा होते हैं, fields अलग-अलग होते हैं, alarms बहुत शोर करते हैं, और टीम को यह नहीं पता होता कि पहले कौन सा संकेत देखना है। AWS CloudWatch में Logs, Metrics, Alarms, Dashboards और Logs Insights हैं, लेकिन ये तभी उपयोगी बनते हैं जब application structured data लिखती है और team के पास repeatable queries और runbook हो।

यह guide दिखाती है कि Claude Code की मदद से CloudWatch को Lambda, ECS, API Gateway, ALB और business metrics के लिए practical observability workflow में कैसे बदला जाए। उद्देश्य यह नहीं है कि agent production में अपने आप बदलाव करे। उद्देश्य है कि Claude Code JSON log format, Logs Insights queries, metric filters, CloudFormation/SAM alarms, dashboard widgets, IAM read-only policy और incident review prompt जल्दी तैयार करे।

कुछ शब्द आसान भाषा में: structured log वह log है जिसमें JSON जैसे fixed fields होते हैं। metric वह संख्या है जिसे समय के साथ graph किया जा सकता है, जैसे request count, 5xx count या p95 latency। alarm metric पर threshold rule है। runbook incident के समय follow की जाने वाली checklist है। least privilege का मतलब है कि Claude Code को शुरुआत में सिर्फ जरूरी read permissions दिए जाएं।

Architecture

flowchart LR
  App["Lambda / ECS / API Gateway"] --> Logs["CloudWatch Logs"]
  App --> Metrics["CloudWatch Metrics"]
  Logs --> Insights["Logs Insights queries"]
  Logs --> Filter["Metric filters"]
  Metrics --> Alarms["CloudWatch Alarms"]
  Metrics --> Dash["Dashboards"]
  Insights --> Claude["Claude Code review"]
  Alarms --> Runbook["SNS / PagerDuty / runbook"]

Claude Code तब बेहतर काम करता है जब input साफ हो: log format, time window, service names, alarm का उद्देश्य और allowed commands।

तीन असली use cases

पहला use case Lambda batch failure है। सिर्फ REPORT line business root cause नहीं बताती। JSON log में jobId, external API name, retry count और exception name जोड़ें। फिर Claude Code से पूछें कि पिछले दो घंटे की failures किसी partner, deploy version या input type से जुड़ी हैं या नहीं।

दूसरा use case ECS API में 5xx बढ़ना है। ALB metric HTTPCode_Target_5XX_Count सिर्फ इतना बताता है कि target ने error दिया। Logs Insights में route, statusCode, durationMs से grouping करने पर Claude Code noisy routes, high p95 routes और deploy के बाद आए नए errors निकाल सकता है।

तीसरा use case API Gateway latency है। Latency और IntegrationLatency को अलग देखने से पता चलता है कि समस्या gateway/authorizer में है या Lambda/ECS backend में। Dashboard में average से ज्यादा p95 और p99 रखें।

चौथा use case alarm fatigue है। Warning Slack में जा सकता है, Critical PagerDuty में। एक ही root cause से जुड़े multiple alarms को composite alarm से कम किया जा सकता है।

Structured JSON logs

यह Node.js logger Lambda और ECS दोनों में चल सकता है। जरूरी बात है कि field names स्थिर रहें।

// logger.mjs
export function logEvent(level, message, fields = {}) {
  const entry = {
    timestamp: new Date().toISOString(),
    level,
    message,
    service: process.env.SERVICE_NAME || "checkout-api",
    env: process.env.NODE_ENV || "development",
    requestId: fields.requestId || "unknown",
    route: fields.route,
    statusCode: fields.statusCode,
    durationMs: fields.durationMs,
    userId: fields.userId,
    errorName: fields.error?.name,
    errorMessage: fields.error?.message,
  };

  console.log(JSON.stringify(entry));
}

logEvent("ERROR", "payment authorization failed", {
  requestId: "req-123",
  route: "POST /checkout",
  statusCode: 502,
  durationMs: 842,
  userId: "user-456",
  error: new Error("upstream timeout"),
});

Local test:

node logger.mjs | jq .

Card number, access token या raw email को log न करें। CloudWatch Logs में data protection features हैं, लेकिन सबसे अच्छा design वही है जिसमें sensitive value emit ही न हो।

Logs Insights queries

Claude Code को schema और सवाल दोनों दें।

claude -p "
CloudWatch Logs Insights queries बनाइए।
Logs JSON हैं और fields हैं timestamp, level, message, service, route, statusCode, durationMs, requestId, userId।
मुझे चाहिए:
1. पिछले 1 घंटे में route के हिसाब से 5xx Top 10
2. highest p95 latency वाली routes
3. एक requestId की timeline
4. deploy के बाद बढ़े errorName
सिर्फ queries दें, छोटे comments के साथ।
"

Starter queries:

-- route के हिसाब से 5xx
fields @timestamp, route, statusCode, requestId
| filter statusCode >= 500
| stats count(*) as errors by route
| sort errors desc
| limit 10

-- route के हिसाब से p95 latency
fields route, durationMs
| filter ispresent(durationMs)
| stats pct(durationMs, 95) as p95, count(*) as requests by route
| sort p95 desc
| limit 20

-- requestId timeline
fields @timestamp, level, message, route, statusCode, durationMs
| filter requestId = "req-123"
| sort @timestamp asc

-- error names
fields @timestamp, errorName, route
| filter level = "ERROR" and ispresent(errorName)
| stats count(*) as count by errorName, route
| sort count desc
| limit 20

Cost pitfall यह है कि query बहुत बड़ा time range scan करे। Logs Insights scanned data के हिसाब से charge करता है, इसलिए time window और log groups सीमित रखें।

Metric Filter और SAM alarm

जब log event को metric बनाना हो, Metric Filter उपयोग करें।

Resources:
  CheckoutLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: /aws/lambda/checkout-api
      RetentionInDays: 30

  PaymentFailureMetricFilter:
    Type: AWS::Logs::MetricFilter
    Properties:
      LogGroupName: !Ref CheckoutLogGroup
      FilterPattern: '{ $.level = "ERROR" && $.message = "payment authorization failed" }'
      MetricTransformations:
        - MetricNamespace: MyApp/Business
          MetricName: PaymentFailure
          MetricValue: "1"
          DefaultValue: 0

  PaymentFailureAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: prod-payment-failure-critical
      AlarmDescription: Five or more payment failures in ten minutes
      Namespace: MyApp/Business
      MetricName: PaymentFailure
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      DatapointsToAlarm: 2
      Threshold: 5
      ComparisonOperator: GreaterThanOrEqualToThreshold
      TreatMissingData: notBreaching
      AlarmActions:
        - arn:aws:sns:ap-south-1:123456789012:prod-alerts

EvaluationPeriods बताता है कितने periods देखें। DatapointsToAlarm बताता है उनमें से कितने periods threshold cross करें तो alarm बजे। एक period तेज है लेकिन noisy है; बहुत ज्यादा periods detection देर से करते हैं।

Dashboard JSON

Dashboard को console में manual रखने के बजाय JSON में रखें।

{
  "widgets": [
    {
      "type": "metric",
      "x": 0,
      "y": 0,
      "width": 12,
      "height": 6,
      "properties": {
        "region": "ap-south-1",
        "title": "API health: requests, 5xx, latency",
        "view": "timeSeries",
        "metrics": [
          ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/myapp/abc", { "stat": "Sum" }],
          [".", "HTTPCode_Target_5XX_Count", ".", ".", { "stat": "Sum", "yAxis": "right" }],
          ["AWS/ApiGateway", "Latency", "ApiName", "checkout-api", "Stage", "prod", { "stat": "p95" }]
        ],
        "period": 60
      }
    }
  ]
}

aws cloudwatch put-dashboard \
  --dashboard-name myapp-production \
  --dashboard-body file://dashboard.json

Incident review prompt और IAM

Investigation में commands सीमित रखें।

claude -p "
इस production incident की review करें। Facts और hypotheses अलग लिखें।
Scope:
- log groups: /aws/lambda/checkout-api, /ecs/checkout-api
- window: 2026-06-02T10:00:00+09:00 to 2026-06-02T11:00:00+09:00
- recent change: checkout-api v1.42.0 deploy
Allowed read-only commands:
- aws logs start-query / get-query-results
- aws cloudwatch get-metric-data
- aws cloudwatch describe-alarms
Output: timeline, impact, top 3 root-cause hypotheses, immediate mitigation, prevention, missing alarms.
"

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:StartQuery",
        "logs:GetQueryResults",
        "logs:FilterLogEvents",
        "cloudwatch:GetMetricData",
        "cloudwatch:DescribeAlarms",
        "cloudwatch:GetDashboard"
      ],
      "Resource": "*"
    }
  ]
}

Common pitfalls

Alarm fatigue सबसे आम गलती है। CPU जैसे internal symptom से ज्यादा 5xx, p95 latency, queue backlog, checkout failure और failed jobs जैसे user-impact metrics को प्राथमिकता दें।

Log retention न सेट करना भी महंगा पड़ता है। Lambda और ECS log groups के लिए 30 दिन से शुरुआत करें; audit requirement हो तभी ज्यादा रखें।

High-cardinality metrics से बचें। userId और requestId को metric dimensions में न डालें। Detail logs में रखें, aggregate metrics में।

Claude Code में बहुत सारे raw logs paste न करें। पहले query से narrow करें, sensitive data mask करें, फिर evidence-based conclusion मांगें।

अगला कदम

एक critical API चुनें और पांच चीजें लागू करें: JSON logs, 5xx query, p95 query, एक Critical alarm और dashboard की एक row। फिर इसे Claude Code × AWS ECS/Fargate और Claude Code × AWS IAM के साथ जोड़ें।

Hands-on verification: logger output को jq से local check किया गया। SAM snippet में log group, region, account और SNS ARN बदलकर existing template में लगाया जा सकता है। Real paging से पहले test account में thresholds calibrate करें।

Claude Code × AWS CloudWatch: लॉग, मेट्रिक्स, अलार्म, डैशबोर्ड और incident review

Architecture

तीन असली use cases

Structured JSON logs

Logs Insights queries

Metric Filter और SAM alarm

Dashboard JSON

Incident review prompt और IAM

Common pitfalls

अगला कदम

Official references

मुफ़्त PDF: Claude Code cheatsheet

संबंधित लेख

Claude Code Obsidian to CLAUDE.md workflow: context बार-बार न समझाएं

Claude Code Revenue CTA Routing: article से PDF, Gumroad और consultation तक

Claude Code टीम हैंडऑफ नियम: review proof, permissions, rollback और revenue path