Harness Engineering Complete Guide: Build AI Agents the Claude Code Way
Harness engineering with Claude Code patterns, policy JSON, runnable Node.js code, and real workflows.
The era of “write a clever prompt and hope the model handles the rest” is over. The serious work in AI agent development is moving toward harness engineering: the design of the scaffold around the model. If you have heard the term test harness in software testing, think of this as the agent version: the tools, context, permission rules, verification steps, and control loop that let an LLM do real work without drifting or causing damage.
Claude Code is a useful reference because it is not just a chat interface. It wraps the model with filesystem tools, project instructions, hooks, permissions, subagents, and memory. This article turns that architecture into a beginner-friendly but deep playbook. You will see why harness engineering is trending, how to implement a minimal Node.js harness, and how to apply the same idea to content automation, code review, SaaS integration, cloud operations, and security boundaries.
Why Harness Engineering Is Trending
LLMs are good at deciding what to do next, but they are not naturally good at observing a workspace, enforcing policy, rolling back a mistake, or proving that an output is safe to publish. Agent products are improving because the model is surrounded by better operational machinery.
That is exactly where Claude Code points the industry. The Claude Agent SDK documentation describes support for Claude Code-style filesystem features such as CLAUDE.md, skills, hooks, and permissions. The permissions docs explain allow rules, deny rules, permission modes, and runtime callbacks. Prompt caching docs explain how repeated static context can be reused. These are harness features. They are not “just prompting.”
The OODA loop makes the division clear:
| Phase | What Happens | Owner |
|---|---|---|
| Observe | Read files, tickets, logs, URLs, and API state | Harness |
| Orient | compress and structure the context | Harness |
| Decide | choose the next action | LLM |
| Act | run a tool, write a file, call an API, or stop | Harness |
Three of the four phases are mostly harness work. A weak harness creates a weak agent even when the model is strong.
What A Harness Contains
A practical harness answers four questions before the model starts working:
- What input is the agent allowed to read?
- What artifact must it produce?
- Which checks prove that the work is correct?
- Which actions are automatic, ask-first, or forbidden?
For a blog workflow, that might mean: read existing slugs, pick a non-duplicate topic, write MDX, check code fences, validate frontmatter, add official links, add internal links, place /products/ and /training/ calls to action, build the site, and inspect the public URL. The prompt is only one part of that workflow.
| Harness Layer | Example | Pitfall |
|---|---|---|
| Context | project rules, style guide, prior failures | stale assumptions stay alive |
| Tools | read, grep, write, run tests, call APIs | too many broad tools confuse the model |
| Policy | allow, ask, deny, rate limit, sandbox | destructive work runs unattended |
| Verification | tests, diff, screenshots, public URLs | generated output looks plausible but is broken |
| Memory | reusable preferences and decisions | temporary notes become permanent rules |
Concept Diagram
A harness is the control layer before and after the model. The important work is not only the prompt; it is the policy, context, tools, permission gate, and verification loop.
flowchart LR
A["Goal"] --> B["Harness policy"]
B --> C["Context"]
B --> D["Tools"]
B --> E["Permissions"]
C --> F["LLM decision"]
D --> F
E --> G["Safe action"]
F --> G
G --> H["Verification"]
H --> I["Artifact"]
H --> B
A Runnable Node.js Mini Harness
The smallest useful harness has a model, two tools, a policy file, a loop, and readable errors. Create a disposable folder and set ANTHROPIC_API_KEY first.
mkdir harness-demo
cd harness-demo
npm init -y
npm install @anthropic-ai/sdk
node -e "const fs=require('node:fs');fs.mkdirSync('sandbox',{recursive:true});fs.writeFileSync('sandbox/README.md','# Demo\nShip a safer agent workflow.\nKeep writes inside sandbox.\n');"
Save this as policy.json.
{
"workspace": "./sandbox",
"maxSteps": 6,
"tools": {
"read_file": {
"allow": true,
"risk": "Read UTF-8 text only inside workspace"
},
"write_file": {
"allow": true,
"risk": "Write UTF-8 text only inside workspace"
}
}
}
Save this as mini-harness.mjs.
import Anthropic from "@anthropic-ai/sdk";
import { mkdir, readFile, writeFile } from "node:fs/promises";
import path from "node:path";
const client = new Anthropic();
const policy = JSON.parse(await readFile(new URL("./policy.json", import.meta.url), "utf8"));
const model = process.env.ANTHROPIC_MODEL || "claude-sonnet-4-6";
const workspace = path.resolve(policy.workspace);
function safePath(requestedPath) {
const resolved = path.resolve(workspace, requestedPath);
const inside = resolved === workspace || resolved.startsWith(workspace + path.sep);
if (!inside) {
throw new Error(`Path escapes workspace: ${requestedPath}. Use a path under ${policy.workspace}.`);
}
return resolved;
}
function ensureAllowed(toolName) {
const rule = policy.tools?.[toolName];
if (!rule?.allow) {
throw new Error(`Tool '${toolName}' is not allowed by policy.json.`);
}
}
const tools = [
{
name: "read_file",
description: "Read a UTF-8 text file from the allowed workspace.",
input_schema: {
type: "object",
properties: { path: { type: "string" } },
required: ["path"],
additionalProperties: false
}
},
{
name: "write_file",
description: "Write a UTF-8 text file inside the allowed workspace.",
input_schema: {
type: "object",
properties: {
path: { type: "string" },
content: { type: "string" }
},
required: ["path", "content"],
additionalProperties: false
}
}
];
async function executeTool(name, input) {
ensureAllowed(name);
if (name === "read_file") {
return await readFile(safePath(input.path), "utf8");
}
if (name === "write_file") {
const target = safePath(input.path);
await mkdir(path.dirname(target), { recursive: true });
await writeFile(target, input.content, "utf8");
return `written ${input.path}`;
}
throw new Error(`Unknown tool: ${name}`);
}
async function run(goal) {
const messages = [{ role: "user", content: goal }];
for (let step = 0; step < policy.maxSteps; step++) {
const response = await client.messages.create({
model,
max_tokens: 1200,
tools,
system: "You are a careful file assistant. Use tools when needed. Keep writes under policy workspace.",
messages
});
messages.push({ role: "assistant", content: response.content });
const toolUses = response.content.filter((block) => block.type === "tool_use");
if (toolUses.length === 0) {
const text = response.content
.filter((block) => block.type === "text")
.map((block) => block.text)
.join("\n");
console.log(text);
return;
}
const results = [];
for (const toolUse of toolUses) {
try {
const output = await executeTool(toolUse.name, toolUse.input);
results.push({ type: "tool_result", tool_use_id: toolUse.id, content: String(output).slice(0, 8000) });
} catch (error) {
results.push({
type: "tool_result",
tool_use_id: toolUse.id,
is_error: true,
content: error instanceof Error ? error.message : String(error)
});
}
}
messages.push({ role: "user", content: results });
}
throw new Error(`Max steps reached: ${policy.maxSteps}`);
}
const goal = process.argv.slice(2).join(" ") || "Read README.md and write summary.md with three bullet points.";
await run(goal);
Run it:
node mini-harness.mjs
This is intentionally small, but it already has the essential pattern: tool schema, policy, sandboxed paths, loop limit, readable tool errors, and a concrete artifact. Add grep, test execution, approval UI, SaaS API calls, and hooks, and you are moving toward a Claude Code-style harness.
Five Concrete Use Cases
1. Content automation
A weak prompt says “write a blog post.” A harness reads existing posts, chooses a non-duplicate topic, writes MDX, checks frontmatter, validates code fences, adds official links, inserts internal links, includes /products/ and /training/ CTAs, builds the site, and verifies the public page. The pitfall is mass-producing thin localized summaries that pass a superficial word count but fail reader intent.
2. Code review
A review harness reads git diff, test output, changed files, and local review rules. It returns findings first, with severity and file references. The risk is that the model summarizes the patch instead of finding regressions. The harness should force bug-first output and require a test-gap note.
3. SaaS integration For tools like Notion, HubSpot, Stripe, or a CRM, the harness should separate read-only lookup, dry-run mutation, and approved write. A practical workflow might classify consultation leads and stage CRM updates, but ask before writing to production. The pitfall is letting a misclassified lead, billing change, or customer note update immediately.
4. Cloud operations Cloud work needs more than a deploy command. The harness should check environment variables, build output, diff, deployment target, rollback plan, health endpoint, and public URL. The risk is chasing the last visible log line instead of the root cause. A bounded retry loop and log summarizer help keep the agent honest.
5. Security boundaries
Security is not a final polish step. Reads can be broad, writes should be scoped, shell commands should be allow-listed, and destructive commands should be denied or ask-first. rm, force pushes, production database writes, billing changes, and secret access deserve explicit policy. The harness exists because trust alone is not a control.
What To Borrow From Claude Code
Borrow three ideas first.
Context layering: keep stable project rules in CLAUDE.md or equivalent project instructions, session-only notes in a task plan, and durable preferences in memory. This prevents stale one-off decisions from becoming permanent rules.
Hooks: move deterministic checks out of the model. Formatting, linting, tests, link checks, and screenshot verification should run as commands. Claude can interpret failures, but the checks should not depend on Claude’s mood.
Delegation: isolate noisy work. Long logs, broad search, multi-language translation, and large refactors can be handled by subagents or separate workflow stages. The main context should retain decisions, not every intermediate token.
Common Pitfalls
Too many tools: 30 tools create choice overload. Start with 5 to 10 focused tools and split the rest into separate workflows.
Unreadable errors: Error: failed is useless. Return what failed, what was expected, and what the model can try next.
No prompt caching: repeated static instructions cost time and money. Anthropic prompt caching defaults to a 5-minute lifetime, with a 1-hour option for slower workflows.
No verification: generated content is not the same as accepted content. Articles need frontmatter and code-fence checks. Code needs tests. Cloud work needs health checks. SaaS writes need audit logs.
Permission drift: temporary convenience often becomes permanent danger. Review allow, ask, and deny rules regularly.
Next Steps
If you want to harden your first harness, read the Claude Code Permissions Guide next. For project-level context, use the CLAUDE.md best practices guide. For splitting heavy work, see Claude Code subagent patterns, and for cost control read Claude Code token optimization.
For a lightweight desk reference, keep the free Claude Code Quick Reference Cheatsheet open. If you want packaged templates and playbooks, start from /products/. If your team needs a safer workflow for permissions, review gates, publishing, or revenue operations, use /training/ when the operating model is the hard part.
What I Tested
The practical lesson from using this pattern on ClaudeCodeLab is simple: quality improves when failure becomes visible. A prompt can produce an article, but a harness can tell you whether the article has enough depth, valid code fences, working links, correct frontmatter, localized CTAs, and a public URL that renders. That is the difference between trusting an output and operating a workflow.
Summary
Harness engineering is the discipline of deciding what the model may see, what it may do, where it must stop, and how the result is verified. Claude Code is one of the clearest teaching examples because its value is not only the model but the scaffold around it. Start with the mini harness above, then add one boundary and one verification step for your own use case.
References
Free PDF: Claude Code Cheatsheet
Enter your email and download the one-page Claude Code cheatsheet for commands, review habits, and safe workflows.
We handle your data with care and never send spam.
Level up your Claude Code workflow
Start with the free PDF, use Gumroad guides when you need repeatable workflows, and book consultation when rollout or revenue paths need human judgment.
About the Author
Masa
Engineer focused on practical Claude Code workflows. Runs claudecode-lab.com, a 10-language technical media site.
Related Posts
Claude Code Permission Safety Ladder: Expand Access Without Losing Control
A beginner-friendly ladder for moving Claude Code from read-only to limited edits, proof commands, and deploy checks.
Claude Code Small PR Proof Pack: Make Tiny Changes Reviewable
A practical proof pack for Claude Code PRs: diff, checks, public URL, CTA path, and rollback note.
Claude Code Review Gate Before Commit: Diff, Tests, Public URL, and CTA Checks
A commit-time review gate for Claude Code work: diff scope, build, public URL, revenue CTA links, missing tests, and unrelated files.
Related Products
50 Battle-Tested Claude Code Prompt Templates
Copy, paste, ship. 50 production-ready prompts.
Use proven prompts for code review, refactoring, testing, documentation, debugging, architecture, and incident response.
The Complete Claude Code Setup & Configuration Guide
From install to team-ready workflow.
A practical guide to installation, CLAUDE.md, hooks, MCP servers, permissions, IDE setup, and CI/CD workflows.