Claude Code × AWS CloudWatch: Logs, Metrics, Alarms, Dashboards, and Incident Review

The painful incident is not the one with no logs. It is the one with thousands of logs and no clear path from symptom to decision. AWS CloudWatch gives you Logs, Metrics, Alarms, Dashboards, and Logs Insights, but those pieces only become useful when the application emits structured data and the team has repeatable queries and runbooks.

This guide shows how I would use Claude Code to turn CloudWatch into an implementation-oriented observability workflow for Lambda, ECS, API Gateway, ALB, and business metrics. The goal is not to let an agent change production blindly. The goal is to let Claude Code draft the boring, error-prone parts: JSON log shape, Logs Insights queries, metric filters, CloudFormation/SAM alarms, dashboard widgets, IAM read-only access, and an incident-review prompt.

Plain terms first. Structured logs are logs with stable fields, usually JSON, instead of only human sentences. Metrics are numeric time series such as request count, 5xx count, or p95 latency. Alarms are threshold rules over metrics. A runbook is the checklist responders follow during an incident. Least privilege means the role used by Claude Code can read only the AWS data needed for investigation unless you intentionally grant write access.

Architecture

flowchart LR
  App["Lambda / ECS / API Gateway"] --> Logs["CloudWatch Logs"]
  App --> Metrics["CloudWatch Metrics"]
  Logs --> Insights["Logs Insights queries"]
  Logs --> Filter["Metric filters"]
  Metrics --> Alarms["CloudWatch Alarms"]
  Metrics --> Dash["Dashboards"]
  Insights --> Claude["Claude Code incident review"]
  Alarms --> Runbook["SNS / PagerDuty / runbook"]

Claude Code helps when the inputs are explicit: log format, time window, service names, alarm intent, and the level of authority it has. Treat it as a fast reviewer and generator, not as an autonomous production operator.

Three Real Use Cases

For a failing Lambda batch, plain REPORT lines are not enough. Add job ID, partner API name, retry count, and exception name to JSON logs. Then ask Claude Code to read the last two hours and separate facts from hypotheses: is the failure tied to one partner, one deployment, one region, or one input class?

For rising ECS API 5xx, the ALB metric tells you that targets failed, not why. Combine HTTPCode_Target_5XX_Count with Logs Insights grouped by route, statusCode, and durationMs. Ask Claude Code for the top noisy routes, the slowest p95 routes, and errors that first appeared after the latest deployment.

For API Gateway latency, compare Latency and IntegrationLatency. If both rise together, the backend is likely slow. If only total latency rises, gateway configuration, authorizers, or throttling may be involved. A dashboard row with p95 and p99 is more useful than an average latency graph.

For alarm fatigue, use Claude Code to review whether each alarm maps to user impact. CPU warnings can go to Slack. Critical 5xx, checkout failure, or queue backlog can page. Composite alarms can reduce noise when several symptoms have the same cause.

Emit Structured JSON Logs

Start by making logs queryable. This Node.js logger works in Lambda or ECS and keeps field names consistent.

// logger.mjs
export function logEvent(level, message, fields = {}) {
  const entry = {
    timestamp: new Date().toISOString(),
    level,
    message,
    service: process.env.SERVICE_NAME || "checkout-api",
    env: process.env.NODE_ENV || "development",
    requestId: fields.requestId || "unknown",
    route: fields.route,
    statusCode: fields.statusCode,
    durationMs: fields.durationMs,
    userId: fields.userId,
    errorName: fields.error?.name,
    errorMessage: fields.error?.message,
  };

  console.log(JSON.stringify(entry));
}

logEvent("ERROR", "payment authorization failed", {
  requestId: "req-123",
  route: "POST /checkout",
  statusCode: 502,
  durationMs: 842,
  userId: "user-456",
  error: new Error("upstream timeout"),
});

Local smoke test:

node logger.mjs | jq .

Do not log card numbers, access tokens, or raw email addresses. CloudWatch Logs has data-protection features, but the safer design is to avoid emitting sensitive values in the first place.

Create Logs Insights Queries

Give Claude Code the log schema and the operational question. That usually produces a better query than asking for “some CloudWatch queries.”

claude -p "
Create CloudWatch Logs Insights queries.
The logs are JSON with timestamp, level, message, service, route, statusCode, durationMs, requestId, userId.
I need:
1. Top 10 routes by 5xx count in the last hour
2. Routes with the highest p95 latency
3. Timeline for one requestId
4. Error names that increased after a deployment
Return only the queries with short purpose comments.
"

Use these as a starter kit:

-- 5xx count by route
fields @timestamp, route, statusCode, requestId
| filter statusCode >= 500
| stats count(*) as errors by route
| sort errors desc
| limit 10

-- p95 latency by route
fields route, durationMs
| filter ispresent(durationMs)
| stats pct(durationMs, 95) as p95, count(*) as requests by route
| sort p95 desc
| limit 20

-- request timeline
fields @timestamp, level, message, route, statusCode, durationMs
| filter requestId = "req-123"
| sort @timestamp asc

-- error names after deployment
fields @timestamp, errorName, route
| filter level = "ERROR" and ispresent(errorName)
| stats count(*) as count by errorName, route
| sort count desc
| limit 20

The common cost pitfall is scanning too much data. Keep the time window tight and query only the log groups you need.

Add Metric Filters and SAM Alarms

Use metric filters when a log event should become a CloudWatch metric, such as checkout failures. This CloudFormation/SAM fragment creates a log group, a metric filter, and an alarm.

Resources:
  CheckoutLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: /aws/lambda/checkout-api
      RetentionInDays: 30

  PaymentFailureMetricFilter:
    Type: AWS::Logs::MetricFilter
    Properties:
      LogGroupName: !Ref CheckoutLogGroup
      FilterPattern: '{ $.level = "ERROR" && $.message = "payment authorization failed" }'
      MetricTransformations:
        - MetricNamespace: MyApp/Business
          MetricName: PaymentFailure
          MetricValue: "1"
          DefaultValue: 0

  PaymentFailureAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: prod-payment-failure-critical
      AlarmDescription: Five or more payment failures in ten minutes
      Namespace: MyApp/Business
      MetricName: PaymentFailure
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      DatapointsToAlarm: 2
      Threshold: 5
      ComparisonOperator: GreaterThanOrEqualToThreshold
      TreatMissingData: notBreaching
      AlarmActions:
        - arn:aws:sns:us-east-1:123456789012:prod-alerts

EvaluationPeriods controls how many periods are evaluated. DatapointsToAlarm controls how many of those periods must breach the threshold. A single period is fast but noisy. Too many periods detect late.

Manage Dashboards as JSON

Dashboards become more reliable when they are stored as code. Start with one JSON body and let Claude Code add rows for Lambda, ECS, API Gateway, ALB, and business KPIs.

{
  "widgets": [
    {
      "type": "metric",
      "x": 0,
      "y": 0,
      "width": 12,
      "height": 6,
      "properties": {
        "region": "us-east-1",
        "title": "API health: requests, 5xx, latency",
        "view": "timeSeries",
        "stacked": false,
        "metrics": [
          ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/myapp/abc", { "stat": "Sum" }],
          [".", "HTTPCode_Target_5XX_Count", ".", ".", { "stat": "Sum", "yAxis": "right" }],
          ["AWS/ApiGateway", "Latency", "ApiName", "checkout-api", "Stage", "prod", { "stat": "p95" }]
        ],
        "period": 60
      }
    }
  ]
}

Deploy it with:

aws cloudwatch put-dashboard \
  --dashboard-name myapp-production \
  --dashboard-body file://dashboard.json

Incident Review Prompt and IAM

During incidents, constrain Claude Code to read-only commands and named log groups.

claude -p "
Review this production incident. Separate facts from hypotheses.

Scope:
- log groups: /aws/lambda/checkout-api, /ecs/checkout-api
- window: 2026-06-02T10:00:00+09:00 to 2026-06-02T11:00:00+09:00
- recent change: checkout-api v1.42.0 deploy

Allowed read-only commands:
- aws logs start-query / get-query-results
- aws cloudwatch get-metric-data
- aws cloudwatch describe-alarms

Output:
1. Timeline
2. Blast radius
3. Top 3 root-cause hypotheses
4. Immediate rollback or mitigation candidates
5. Prevention actions
6. Missing alarms and dashboard widgets
"

Use a read-only role first:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:StartQuery",
        "logs:GetQueryResults",
        "logs:FilterLogEvents",
        "cloudwatch:GetMetricData",
        "cloudwatch:DescribeAlarms",
        "cloudwatch:GetDashboard"
      ],
      "Resource": "*"
    }
  ]
}

Grant PutMetricAlarm and PutDashboard only to a separate automation role that you use deliberately.

Pitfalls

Alarm fatigue happens when every technical symptom pages a human. Prefer user-impact metrics: 5xx, p95 latency, queue backlog, checkout failure, and failed jobs. Send warning-level noise to chat and reserve paging for critical conditions.

Unbounded log retention is another quiet cost leak. Set retention on Lambda and ECS log groups. Keep production logs longer only when audit or debugging requirements justify it.

High-cardinality metrics are expensive and hard to read. Do not put userId or requestId into metric dimensions. Keep those in logs and aggregate metrics by service, route, environment, and error class.

Finally, do not paste unlimited logs into Claude Code. Query first, narrow the time range, mask sensitive values, and ask for evidence-backed conclusions.

Next Step

Pick one critical API and implement five things: JSON logs, a 5xx query, a p95 latency query, one critical alarm, and one dashboard row. Then connect it with the deployment and permission practices in Claude Code × AWS ECS/Fargate and Claude Code × AWS IAM.

Hands-on verification for this article: I checked the Node logger output locally with jq, and the SAM fragment is structured so you can paste it into an existing template after replacing log group names, regions, account IDs, and SNS ARNs. Test alarm thresholds in a non-production AWS account before paging real responders.

Claude Code × AWS CloudWatch: Logs, Metrics, Alarms, Dashboards, and Incident Review

Architecture

Three Real Use Cases

Emit Structured JSON Logs

Create Logs Insights Queries

Add Metric Filters and SAM Alarms

Manage Dashboards as JSON

Incident Review Prompt and IAM

Pitfalls

Next Step

Official References

Free PDF: Claude Code Cheatsheet

Level up your Claude Code workflow

Related Posts

Claude Code Obsidian to CLAUDE.md Workflow: Stop Re-explaining Context

Claude Code Revenue CTA Routing: Send Articles to PDF, Gumroad, and Consultation

Claude Code Team Handoff Rules: Review Evidence, Permissions, Rollback, and Revenue Paths

Related Products

50 Battle-Tested Claude Code Prompt Templates

The Complete Claude Code Setup & Configuration Guide