Claude Code × AWS CloudWatch: Logs, Metrics, Alarms, Dashboards, and Incident Review
Build practical CloudWatch logs, metrics, alarms, dashboards, and incident reviews with Claude Code.
The painful incident is not the one with no logs. It is the one with thousands of logs and no clear path from symptom to decision. AWS CloudWatch gives you Logs, Metrics, Alarms, Dashboards, and Logs Insights, but those pieces only become useful when the application emits structured data and the team has repeatable queries and runbooks.
This guide shows how I would use Claude Code to turn CloudWatch into an implementation-oriented observability workflow for Lambda, ECS, API Gateway, ALB, and business metrics. The goal is not to let an agent change production blindly. The goal is to let Claude Code draft the boring, error-prone parts: JSON log shape, Logs Insights queries, metric filters, CloudFormation/SAM alarms, dashboard widgets, IAM read-only access, and an incident-review prompt.
Plain terms first. Structured logs are logs with stable fields, usually JSON, instead of only human sentences. Metrics are numeric time series such as request count, 5xx count, or p95 latency. Alarms are threshold rules over metrics. A runbook is the checklist responders follow during an incident. Least privilege means the role used by Claude Code can read only the AWS data needed for investigation unless you intentionally grant write access.
Architecture
flowchart LR
App["Lambda / ECS / API Gateway"] --> Logs["CloudWatch Logs"]
App --> Metrics["CloudWatch Metrics"]
Logs --> Insights["Logs Insights queries"]
Logs --> Filter["Metric filters"]
Metrics --> Alarms["CloudWatch Alarms"]
Metrics --> Dash["Dashboards"]
Insights --> Claude["Claude Code incident review"]
Alarms --> Runbook["SNS / PagerDuty / runbook"]
Claude Code helps when the inputs are explicit: log format, time window, service names, alarm intent, and the level of authority it has. Treat it as a fast reviewer and generator, not as an autonomous production operator.
Three Real Use Cases
For a failing Lambda batch, plain REPORT lines are not enough. Add job ID, partner API name, retry count, and exception name to JSON logs. Then ask Claude Code to read the last two hours and separate facts from hypotheses: is the failure tied to one partner, one deployment, one region, or one input class?
For rising ECS API 5xx, the ALB metric tells you that targets failed, not why. Combine HTTPCode_Target_5XX_Count with Logs Insights grouped by route, statusCode, and durationMs. Ask Claude Code for the top noisy routes, the slowest p95 routes, and errors that first appeared after the latest deployment.
For API Gateway latency, compare Latency and IntegrationLatency. If both rise together, the backend is likely slow. If only total latency rises, gateway configuration, authorizers, or throttling may be involved. A dashboard row with p95 and p99 is more useful than an average latency graph.
For alarm fatigue, use Claude Code to review whether each alarm maps to user impact. CPU warnings can go to Slack. Critical 5xx, checkout failure, or queue backlog can page. Composite alarms can reduce noise when several symptoms have the same cause.
Emit Structured JSON Logs
Start by making logs queryable. This Node.js logger works in Lambda or ECS and keeps field names consistent.
// logger.mjs
export function logEvent(level, message, fields = {}) {
const entry = {
timestamp: new Date().toISOString(),
level,
message,
service: process.env.SERVICE_NAME || "checkout-api",
env: process.env.NODE_ENV || "development",
requestId: fields.requestId || "unknown",
route: fields.route,
statusCode: fields.statusCode,
durationMs: fields.durationMs,
userId: fields.userId,
errorName: fields.error?.name,
errorMessage: fields.error?.message,
};
console.log(JSON.stringify(entry));
}
logEvent("ERROR", "payment authorization failed", {
requestId: "req-123",
route: "POST /checkout",
statusCode: 502,
durationMs: 842,
userId: "user-456",
error: new Error("upstream timeout"),
});
Local smoke test:
node logger.mjs | jq .
Do not log card numbers, access tokens, or raw email addresses. CloudWatch Logs has data-protection features, but the safer design is to avoid emitting sensitive values in the first place.
Create Logs Insights Queries
Give Claude Code the log schema and the operational question. That usually produces a better query than asking for “some CloudWatch queries.”
claude -p "
Create CloudWatch Logs Insights queries.
The logs are JSON with timestamp, level, message, service, route, statusCode, durationMs, requestId, userId.
I need:
1. Top 10 routes by 5xx count in the last hour
2. Routes with the highest p95 latency
3. Timeline for one requestId
4. Error names that increased after a deployment
Return only the queries with short purpose comments.
"
Use these as a starter kit:
-- 5xx count by route
fields @timestamp, route, statusCode, requestId
| filter statusCode >= 500
| stats count(*) as errors by route
| sort errors desc
| limit 10
-- p95 latency by route
fields route, durationMs
| filter ispresent(durationMs)
| stats pct(durationMs, 95) as p95, count(*) as requests by route
| sort p95 desc
| limit 20
-- request timeline
fields @timestamp, level, message, route, statusCode, durationMs
| filter requestId = "req-123"
| sort @timestamp asc
-- error names after deployment
fields @timestamp, errorName, route
| filter level = "ERROR" and ispresent(errorName)
| stats count(*) as count by errorName, route
| sort count desc
| limit 20
The common cost pitfall is scanning too much data. Keep the time window tight and query only the log groups you need.
Add Metric Filters and SAM Alarms
Use metric filters when a log event should become a CloudWatch metric, such as checkout failures. This CloudFormation/SAM fragment creates a log group, a metric filter, and an alarm.
Resources:
CheckoutLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: /aws/lambda/checkout-api
RetentionInDays: 30
PaymentFailureMetricFilter:
Type: AWS::Logs::MetricFilter
Properties:
LogGroupName: !Ref CheckoutLogGroup
FilterPattern: '{ $.level = "ERROR" && $.message = "payment authorization failed" }'
MetricTransformations:
- MetricNamespace: MyApp/Business
MetricName: PaymentFailure
MetricValue: "1"
DefaultValue: 0
PaymentFailureAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: prod-payment-failure-critical
AlarmDescription: Five or more payment failures in ten minutes
Namespace: MyApp/Business
MetricName: PaymentFailure
Statistic: Sum
Period: 300
EvaluationPeriods: 2
DatapointsToAlarm: 2
Threshold: 5
ComparisonOperator: GreaterThanOrEqualToThreshold
TreatMissingData: notBreaching
AlarmActions:
- arn:aws:sns:us-east-1:123456789012:prod-alerts
EvaluationPeriods controls how many periods are evaluated. DatapointsToAlarm controls how many of those periods must breach the threshold. A single period is fast but noisy. Too many periods detect late.
Manage Dashboards as JSON
Dashboards become more reliable when they are stored as code. Start with one JSON body and let Claude Code add rows for Lambda, ECS, API Gateway, ALB, and business KPIs.
{
"widgets": [
{
"type": "metric",
"x": 0,
"y": 0,
"width": 12,
"height": 6,
"properties": {
"region": "us-east-1",
"title": "API health: requests, 5xx, latency",
"view": "timeSeries",
"stacked": false,
"metrics": [
["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/myapp/abc", { "stat": "Sum" }],
[".", "HTTPCode_Target_5XX_Count", ".", ".", { "stat": "Sum", "yAxis": "right" }],
["AWS/ApiGateway", "Latency", "ApiName", "checkout-api", "Stage", "prod", { "stat": "p95" }]
],
"period": 60
}
}
]
}
Deploy it with:
aws cloudwatch put-dashboard \
--dashboard-name myapp-production \
--dashboard-body file://dashboard.json
Incident Review Prompt and IAM
During incidents, constrain Claude Code to read-only commands and named log groups.
claude -p "
Review this production incident. Separate facts from hypotheses.
Scope:
- log groups: /aws/lambda/checkout-api, /ecs/checkout-api
- window: 2026-06-02T10:00:00+09:00 to 2026-06-02T11:00:00+09:00
- recent change: checkout-api v1.42.0 deploy
Allowed read-only commands:
- aws logs start-query / get-query-results
- aws cloudwatch get-metric-data
- aws cloudwatch describe-alarms
Output:
1. Timeline
2. Blast radius
3. Top 3 root-cause hypotheses
4. Immediate rollback or mitigation candidates
5. Prevention actions
6. Missing alarms and dashboard widgets
"
Use a read-only role first:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:StartQuery",
"logs:GetQueryResults",
"logs:FilterLogEvents",
"cloudwatch:GetMetricData",
"cloudwatch:DescribeAlarms",
"cloudwatch:GetDashboard"
],
"Resource": "*"
}
]
}
Grant PutMetricAlarm and PutDashboard only to a separate automation role that you use deliberately.
Pitfalls
Alarm fatigue happens when every technical symptom pages a human. Prefer user-impact metrics: 5xx, p95 latency, queue backlog, checkout failure, and failed jobs. Send warning-level noise to chat and reserve paging for critical conditions.
Unbounded log retention is another quiet cost leak. Set retention on Lambda and ECS log groups. Keep production logs longer only when audit or debugging requirements justify it.
High-cardinality metrics are expensive and hard to read. Do not put userId or requestId into metric dimensions. Keep those in logs and aggregate metrics by service, route, environment, and error class.
Finally, do not paste unlimited logs into Claude Code. Query first, narrow the time range, mask sensitive values, and ask for evidence-backed conclusions.
Next Step
Pick one critical API and implement five things: JSON logs, a 5xx query, a p95 latency query, one critical alarm, and one dashboard row. Then connect it with the deployment and permission practices in Claude Code × AWS ECS/Fargate and Claude Code × AWS IAM.
Hands-on verification for this article: I checked the Node logger output locally with jq, and the SAM fragment is structured so you can paste it into an existing template after replacing log group names, regions, account IDs, and SNS ARNs. Test alarm thresholds in a non-production AWS account before paging real responders.
Official References
Free PDF: Claude Code Cheatsheet
Enter your email and download the one-page Claude Code cheatsheet for commands, review habits, and safe workflows.
We handle your data with care and never send spam.
Level up your Claude Code workflow
Start with the free PDF, use Gumroad guides when you need repeatable workflows, and book consultation when rollout or revenue paths need human judgment.
About the Author
Masa
Engineer focused on practical Claude Code workflows. Runs claudecode-lab.com, a 10-language technical media site.
Related Posts
Claude Code Obsidian to CLAUDE.md Workflow: Stop Re-explaining Context
Turn Obsidian working notes into concise CLAUDE.md operating notes that make Claude Code sessions easier to resume.
Claude Code Revenue CTA Routing: Send Articles to PDF, Gumroad, and Consultation
A Claude Code workflow for routing article readers to the free PDF, Gumroad products, or consultation by intent.
Claude Code Team Handoff Rules: Review Evidence, Permissions, Rollback, and Revenue Paths
A practical Claude Code handoff format for team review, proof, permission rules, rollback, free PDF, Gumroad, and consultation paths.
Related Products
50 Battle-Tested Claude Code Prompt Templates
Copy, paste, ship. 50 production-ready prompts.
Use proven prompts for code review, refactoring, testing, documentation, debugging, architecture, and incident response.
The Complete Claude Code Setup & Configuration Guide
From install to team-ready workflow.
A practical guide to installation, CLAUDE.md, hooks, MCP servers, permissions, IDE setup, and CI/CD workflows.