Harness Engineering Complete Guide: Build AI Agents the Claude Code Way

The era of “write a clever prompt and hope the model handles the rest” is over. The serious work in AI agent development is moving toward harness engineering: the design of the scaffold around the model. If you have heard the term test harness in software testing, think of this as the agent version: the tools, context, permission rules, verification steps, and control loop that let an LLM do real work without drifting or causing damage.

Claude Code is a useful reference because it is not just a chat interface. It wraps the model with filesystem tools, project instructions, hooks, permissions, subagents, and memory. This article turns that architecture into a beginner-friendly but deep playbook. You will see why harness engineering is trending, how to implement a minimal Node.js harness, and how to apply the same idea to content automation, code review, SaaS integration, cloud operations, and security boundaries.

LLMs are good at deciding what to do next, but they are not naturally good at observing a workspace, enforcing policy, rolling back a mistake, or proving that an output is safe to publish. Agent products are improving because the model is surrounded by better operational machinery.

That is exactly where Claude Code points the industry. The Claude Agent SDK documentation describes support for Claude Code-style filesystem features such as CLAUDE.md, skills, hooks, and permissions. The permissions docs explain allow rules, deny rules, permission modes, and runtime callbacks. Prompt caching docs explain how repeated static context can be reused. These are harness features. They are not “just prompting.”

The OODA loop makes the division clear:

Phase	What Happens	Owner
Observe	Read files, tickets, logs, URLs, and API state	Harness
Orient	compress and structure the context	Harness
Decide	choose the next action	LLM
Act	run a tool, write a file, call an API, or stop	Harness

Three of the four phases are mostly harness work. A weak harness creates a weak agent even when the model is strong.

What A Harness Contains

A practical harness answers four questions before the model starts working:

What input is the agent allowed to read?
What artifact must it produce?
Which checks prove that the work is correct?
Which actions are automatic, ask-first, or forbidden?

For a blog workflow, that might mean: read existing slugs, pick a non-duplicate topic, write MDX, check code fences, validate frontmatter, add official links, add internal links, place /products/ and /training/ calls to action, build the site, and inspect the public URL. The prompt is only one part of that workflow.

Harness Layer	Example	Pitfall
Context	project rules, style guide, prior failures	stale assumptions stay alive
Tools	read, grep, write, run tests, call APIs	too many broad tools confuse the model
Policy	allow, ask, deny, rate limit, sandbox	destructive work runs unattended
Verification	tests, diff, screenshots, public URLs	generated output looks plausible but is broken
Memory	reusable preferences and decisions	temporary notes become permanent rules

Concept Diagram

A harness is the control layer before and after the model. The important work is not only the prompt; it is the policy, context, tools, permission gate, and verification loop.

flowchart LR
  A["Goal"] --> B["Harness policy"]
  B --> C["Context"]
  B --> D["Tools"]
  B --> E["Permissions"]
  C --> F["LLM decision"]
  D --> F
  E --> G["Safe action"]
  F --> G
  G --> H["Verification"]
  H --> I["Artifact"]
  H --> B

A Runnable Node.js Mini Harness

The smallest useful harness has a model, two tools, a policy file, a loop, and readable errors. Create a disposable folder and set ANTHROPIC_API_KEY first.

mkdir harness-demo
cd harness-demo
npm init -y
npm install @anthropic-ai/sdk
node -e "const fs=require('node:fs');fs.mkdirSync('sandbox',{recursive:true});fs.writeFileSync('sandbox/README.md','# Demo\nShip a safer agent workflow.\nKeep writes inside sandbox.\n');"

Save this as policy.json.

{
  "workspace": "./sandbox",
  "maxSteps": 6,
  "tools": {
    "read_file": {
      "allow": true,
      "risk": "Read UTF-8 text only inside workspace"
    },
    "write_file": {
      "allow": true,
      "risk": "Write UTF-8 text only inside workspace"
    }
  }
}

Save this as mini-harness.mjs.

import Anthropic from "@anthropic-ai/sdk";
import { mkdir, readFile, writeFile } from "node:fs/promises";
import path from "node:path";

const client = new Anthropic();
const policy = JSON.parse(await readFile(new URL("./policy.json", import.meta.url), "utf8"));
const model = process.env.ANTHROPIC_MODEL || "claude-sonnet-4-6";
const workspace = path.resolve(policy.workspace);

function safePath(requestedPath) {
  const resolved = path.resolve(workspace, requestedPath);
  const inside = resolved === workspace || resolved.startsWith(workspace + path.sep);
  if (!inside) {
    throw new Error(`Path escapes workspace: ${requestedPath}. Use a path under ${policy.workspace}.`);
  }
  return resolved;
}

function ensureAllowed(toolName) {
  const rule = policy.tools?.[toolName];
  if (!rule?.allow) {
    throw new Error(`Tool '${toolName}' is not allowed by policy.json.`);
  }
}

const tools = [
  {
    name: "read_file",
    description: "Read a UTF-8 text file from the allowed workspace.",
    input_schema: {
      type: "object",
      properties: { path: { type: "string" } },
      required: ["path"],
      additionalProperties: false
    }
  },
  {
    name: "write_file",
    description: "Write a UTF-8 text file inside the allowed workspace.",
    input_schema: {
      type: "object",
      properties: {
        path: { type: "string" },
        content: { type: "string" }
      },
      required: ["path", "content"],
      additionalProperties: false
    }
  }
];

async function executeTool(name, input) {
  ensureAllowed(name);
  if (name === "read_file") {
    return await readFile(safePath(input.path), "utf8");
  }
  if (name === "write_file") {
    const target = safePath(input.path);
    await mkdir(path.dirname(target), { recursive: true });
    await writeFile(target, input.content, "utf8");
    return `written ${input.path}`;
  }
  throw new Error(`Unknown tool: ${name}`);
}

async function run(goal) {
  const messages = [{ role: "user", content: goal }];

  for (let step = 0; step < policy.maxSteps; step++) {
    const response = await client.messages.create({
      model,
      max_tokens: 1200,
      tools,
      system: "You are a careful file assistant. Use tools when needed. Keep writes under policy workspace.",
      messages
    });

    messages.push({ role: "assistant", content: response.content });
    const toolUses = response.content.filter((block) => block.type === "tool_use");

    if (toolUses.length === 0) {
      const text = response.content
        .filter((block) => block.type === "text")
        .map((block) => block.text)
        .join("\n");
      console.log(text);
      return;
    }

    const results = [];
    for (const toolUse of toolUses) {
      try {
        const output = await executeTool(toolUse.name, toolUse.input);
        results.push({ type: "tool_result", tool_use_id: toolUse.id, content: String(output).slice(0, 8000) });
      } catch (error) {
        results.push({
          type: "tool_result",
          tool_use_id: toolUse.id,
          is_error: true,
          content: error instanceof Error ? error.message : String(error)
        });
      }
    }
    messages.push({ role: "user", content: results });
  }

  throw new Error(`Max steps reached: ${policy.maxSteps}`);
}

const goal = process.argv.slice(2).join(" ") || "Read README.md and write summary.md with three bullet points.";
await run(goal);

Run it:

node mini-harness.mjs

This is intentionally small, but it already has the essential pattern: tool schema, policy, sandboxed paths, loop limit, readable tool errors, and a concrete artifact. Add grep, test execution, approval UI, SaaS API calls, and hooks, and you are moving toward a Claude Code-style harness.

Five Concrete Use Cases

1. Content automation A weak prompt says “write a blog post.” A harness reads existing posts, chooses a non-duplicate topic, writes MDX, checks frontmatter, validates code fences, adds official links, inserts internal links, includes /products/ and /training/ CTAs, builds the site, and verifies the public page. The pitfall is mass-producing thin localized summaries that pass a superficial word count but fail reader intent.

2. Code review A review harness reads git diff, test output, changed files, and local review rules. It returns findings first, with severity and file references. The risk is that the model summarizes the patch instead of finding regressions. The harness should force bug-first output and require a test-gap note.

3. SaaS integration For tools like Notion, HubSpot, Stripe, or a CRM, the harness should separate read-only lookup, dry-run mutation, and approved write. A practical workflow might classify consultation leads and stage CRM updates, but ask before writing to production. The pitfall is letting a misclassified lead, billing change, or customer note update immediately.

4. Cloud operations Cloud work needs more than a deploy command. The harness should check environment variables, build output, diff, deployment target, rollback plan, health endpoint, and public URL. The risk is chasing the last visible log line instead of the root cause. A bounded retry loop and log summarizer help keep the agent honest.

5. Security boundaries Security is not a final polish step. Reads can be broad, writes should be scoped, shell commands should be allow-listed, and destructive commands should be denied or ask-first. rm, force pushes, production database writes, billing changes, and secret access deserve explicit policy. The harness exists because trust alone is not a control.

What To Borrow From Claude Code

Borrow three ideas first.

Context layering: keep stable project rules in CLAUDE.md or equivalent project instructions, session-only notes in a task plan, and durable preferences in memory. This prevents stale one-off decisions from becoming permanent rules.

Hooks: move deterministic checks out of the model. Formatting, linting, tests, link checks, and screenshot verification should run as commands. Claude can interpret failures, but the checks should not depend on Claude’s mood.

Delegation: isolate noisy work. Long logs, broad search, multi-language translation, and large refactors can be handled by subagents or separate workflow stages. The main context should retain decisions, not every intermediate token.

Common Pitfalls

Too many tools: 30 tools create choice overload. Start with 5 to 10 focused tools and split the rest into separate workflows.

Unreadable errors: Error: failed is useless. Return what failed, what was expected, and what the model can try next.

No prompt caching: repeated static instructions cost time and money. Anthropic prompt caching defaults to a 5-minute lifetime, with a 1-hour option for slower workflows.

No verification: generated content is not the same as accepted content. Articles need frontmatter and code-fence checks. Code needs tests. Cloud work needs health checks. SaaS writes need audit logs.

Permission drift: temporary convenience often becomes permanent danger. Review allow, ask, and deny rules regularly.

Next Steps

If you want to harden your first harness, read the Claude Code Permissions Guide next. For project-level context, use the CLAUDE.md best practices guide. For splitting heavy work, see Claude Code subagent patterns, and for cost control read Claude Code token optimization.

For a lightweight desk reference, keep the free Claude Code Quick Reference Cheatsheet open. If you want packaged templates and playbooks, start from /products/. If your team needs a safer workflow for permissions, review gates, publishing, or revenue operations, use /training/ when the operating model is the hard part.

What I Tested

The practical lesson from using this pattern on ClaudeCodeLab is simple: quality improves when failure becomes visible. A prompt can produce an article, but a harness can tell you whether the article has enough depth, valid code fences, working links, correct frontmatter, localized CTAs, and a public URL that renders. That is the difference between trusting an output and operating a workflow.

Summary

Harness engineering is the discipline of deciding what the model may see, what it may do, where it must stop, and how the result is verified. Claude Code is one of the clearest teaching examples because its value is not only the model but the scaffold around it. Start with the mini harness above, then add one boundary and one verification step for your own use case.

Harness Engineering Complete Guide: Build AI Agents the Claude Code Way

What A Harness Contains

Concept Diagram

A Runnable Node.js Mini Harness

Five Concrete Use Cases

What To Borrow From Claude Code

Common Pitfalls

Next Steps

What I Tested

Summary

References

Free PDF: Claude Code Cheatsheet

Level up your Claude Code workflow

Related Posts

Claude Code Permission Safety Ladder: Expand Access Without Losing Control

Claude Code Small PR Proof Pack: Make Tiny Changes Reviewable

Claude Code Review Gate Before Commit: Diff, Tests, Public URL, and CTA Checks

Related Products

50 Battle-Tested Claude Code Prompt Templates

The Complete Claude Code Setup & Configuration Guide

Why Harness Engineering Is Trending

What A Harness Contains

Concept Diagram

A Runnable Node.js Mini Harness

Five Concrete Use Cases

What To Borrow From Claude Code

Common Pitfalls

Next Steps

What I Tested

Summary

References

Free PDF: Claude Code Cheatsheet

Level up your Claude Code workflow

Related Posts

Claude Code Permission Safety Ladder: Expand Access Without Losing Control

Claude Code Small PR Proof Pack: Make Tiny Changes Reviewable

Claude Code Review Gate Before Commit: Diff, Tests, Public URL, and CTA Checks

Related Products

50 Battle-Tested Claude Code Prompt Templates

The Complete Claude Code Setup & Configuration Guide