Safe Markdown and MDX Processing with Claude Code

Why Markdown Processing Needs More Than Regex

A published Markdown or MDX article is not just text. It carries frontmatter, SEO metadata, headings, generated IDs, code fences, internal links, external references, locale-specific routes, and sometimes raw HTML. If you ask Claude Code to “clean up this article” without a processing contract, the result may read better while silently changing the slug, dropping a CTA, breaking a code fence, or leaving one locale as a thin summary.

The practical rule is simple: let Claude Code edit prose, but make the structure machine-checkable. Use an AST, or abstract syntax tree, for Markdown and MDX. Validate frontmatter as data. Treat HTML sanitization as a security boundary. Run locale and build checks before publishing.

I verified the core references on June 2, 2026. The unified guide explains the parse, transform, and stringify pipeline, while unified syntax trees explain why AST nodes are safer than line matching. Markdown parsing is covered by remark and remark-parse. MDX syntax is documented in the MDX docs. For frontmatter, use gray-matter. For raw HTML and XSS risk, compare rehype-sanitize with the OWASP XSS Prevention Cheat Sheet. Claude Code workflow boundaries are easier to enforce when you also read the official Claude Code overview and settings.

flowchart LR
  A["MDX file"] --> B["frontmatter"]
  B --> C["schema validation"]
  A --> D["remark / MDX AST"]
  D --> E["headings, fences, links"]
  D --> F["rehype HTML pipeline"]
  F --> G["sanitize"]
  C --> H["locale and build checks"]
  E --> H
  G --> H

Choose the Parser by the Job

The first instruction to Claude Code should name the tool chain. A vague “parse Markdown” request often produces a quick regex. That works for a toy file and fails on real articles.

Need	Better choice	Risky shortcut
Read headings, links, and code fences	`remark-parse` plus AST traversal	`^##` regex on raw text
Handle JSX inside `.mdx`	`remark-mdx` or the MDX compiler	Markdown-only parser
Render HTML	`remark-rehype` into the rehype pipeline	String concatenation
Accept raw HTML	`rehype-raw` followed by `rehype-sanitize`	`allowDangerousHtml` alone
Read frontmatter	`gray-matter` and schema checks	Splitting YAML by hand

Regex is still useful for narrow checks, such as finding an exact literal in a file. It should not be the source of truth for Markdown structure. A code fence can contain ## Not a heading; an MDX component can contain links as props; YAML can turn a list into a string if the author forgets brackets. AST and schema validation catch those cases before readers see them.

Four Practical Use Cases

The first use case is refreshing a published blog article. You need updated metadata, a stronger introduction, official links, internal links, working code, and a revenue CTA. For ClaudeCodeLab, that means connecting this topic to guides such as CLAUDE.md best practices and web scraping with Claude Code without changing unrelated slugs.

The second use case is a docs site that mixes Markdown and MDX components. Callouts, tabs, pricing cards, and live examples are useful, but they make regex parsing brittle. The checker must understand both Markdown nodes and MDX nodes.

The third use case is multilingual publishing. A strong Japanese canonical article does not help if English, Spanish, Indonesian, and other locales become short summaries. Each locale needs the same depth: use cases, pitfalls, snippets, official links, internal links, CTA, and verification notes.

The fourth use case is commercial content operations. Product pages, Gumroad landing pages, training pages, and email resources often reuse Markdown. Broken code fences and unsafe HTML reduce trust exactly where conversion matters. If an article sends readers to products or training, the content pipeline has to protect those links.

Copy-Paste Setup

The snippets below use Node.js 18 or newer. They are written as ESM modules so they can run directly in a small tools folder.

mkdir mdx-audit-demo
cd mdx-audit-demo
npm init -y
npm pkg set type=module
npm install unified remark-parse remark-mdx remark-gfm gray-matter
npm install unist-util-visit github-slugger
npm install remark-rehype rehype-raw rehype-sanitize rehype-stringify
mkdir tools

Example 1: Audit Frontmatter, Headings, Fences, and Links

This script reads frontmatter with gray-matter, parses the body with remark and MDX support, then reports headings, code fences, and links. It fails if required metadata is missing, the description is too long, a code fence has no language, or the article lacks internal or external links.

// tools/audit-mdx.mjs
import fs from "node:fs/promises";
import matter from "gray-matter";
import GithubSlugger from "github-slugger";
import { unified } from "unified";
import remarkParse from "remark-parse";
import remarkMdx from "remark-mdx";
import remarkGfm from "remark-gfm";
import { visit } from "unist-util-visit";

const file = process.argv[2];
if (!file) {
  throw new Error("Usage: node tools/audit-mdx.mjs article.mdx");
}

const source = await fs.readFile(file, "utf8");
const { data, content } = matter(source);
const errors = [];
const links = { internal: [], external: [] };
const codeBlocks = [];
const headings = [];

for (const key of ["title", "description", "pubDate", "heroImage", "lang"]) {
  if (typeof data[key] !== "string" || data[key].trim() === "") {
    errors.push(`frontmatter.${key} is required`);
  }
}

if ([...String(data.description ?? "")].length > 120) {
  errors.push("description must be 120 characters or fewer");
}

if (!Array.isArray(data.tags) || data.tags.length === 0) {
  errors.push("frontmatter.tags must be a non-empty array");
}

const tree = unified()
  .use(remarkParse)
  .use(remarkMdx)
  .use(remarkGfm)
  .parse(content);

const slugger = new GithubSlugger();

visit(tree, (node) => {
  if (node.type === "heading") {
    const text = plainText(node);
    headings.push({ depth: node.depth, text, slug: slugger.slug(text) });
  }

  if (node.type === "code") {
    codeBlocks.push({ lang: node.lang || "", meta: node.meta || "" });
    if (!node.lang) errors.push("code fence is missing a language");
  }

  if (node.type === "link") {
    const url = String(node.url || "");
    if (url.startsWith("http")) links.external.push(url);
    if (url.startsWith("/")) links.internal.push(url);
  }
});

if (links.internal.length === 0) errors.push("missing internal link");
if (links.external.length === 0) errors.push("missing external link");

if (errors.length > 0) {
  console.error(errors.map((error) => `- ${error}`).join("\n"));
  process.exit(1);
}

console.log(JSON.stringify({ headings, codeBlocks, links }, null, 2));

function plainText(node) {
  if (typeof node.value === "string") return node.value;
  if (!Array.isArray(node.children)) return "";
  return node.children.map(plainText).join("");
}

Run it against one file first, then wire it into CI only after the false positives are understood.

node tools/audit-mdx.mjs site/src/content/blog-en/example.mdx

Example 2: Convert Markdown to Safe HTML

If you never need raw HTML in Markdown, do not enable it. If the product requires raw HTML, sanitize after parsing. The unsafe pattern is to pass allowDangerousHtml and stop there.

// tools/markdown-to-safe-html.mjs
import fs from "node:fs/promises";
import { unified } from "unified";
import remarkParse from "remark-parse";
import remarkGfm from "remark-gfm";
import remarkRehype from "remark-rehype";
import rehypeRaw from "rehype-raw";
import rehypeSanitize, { defaultSchema } from "rehype-sanitize";
import rehypeStringify from "rehype-stringify";

const file = process.argv[2];
const markdown = await fs.readFile(file, "utf8");
const schema = {
  ...defaultSchema,
  attributes: {
    ...defaultSchema.attributes,
    code: [["className", /^language-/]],
  },
};

const html = await unified()
  .use(remarkParse)
  .use(remarkGfm)
  .use(remarkRehype, { allowDangerousHtml: true })
  .use(rehypeRaw)
  .use(rehypeSanitize, schema)
  .use(rehypeStringify)
  .process(markdown);

console.log(String(html));

The important detail is the order. rehype-raw parses raw HTML into the HTML tree; rehype-sanitize then removes disallowed tags and attributes. Without the second step, user-authored HTML can carry unsafe attributes into the rendered page.

Example 3: Check Locale Files for the Same Slug

For a ten-language site, run a small consistency script before review. This catches the common mistake where the canonical file is updated but one translation keeps the old hero image or missing updatedDate.

// tools/check-locales.mjs
import fs from "node:fs";
import path from "node:path";
import matter from "gray-matter";

const slug = "claude-code-markdown-processing.mdx";
const expectedHero = "/images/hero/hero-077.png";
const locales = [
  ["ja", "site/src/content/blog"],
  ["en", "site/src/content/blog-en"],
  ["zh", "site/src/content/blog-zh"],
  ["ko", "site/src/content/blog-ko"],
  ["es", "site/src/content/blog-es"],
  ["fr", "site/src/content/blog-fr"],
  ["de", "site/src/content/blog-de"],
  ["pt", "site/src/content/blog-pt"],
  ["hi", "site/src/content/blog-hi"],
  ["id", "site/src/content/blog-id"],
];

const errors = [];

for (const [lang, dir] of locales) {
  const file = path.join(dir, slug);
  const source = fs.readFileSync(file, "utf8");
  const { data, content } = matter(source);
  if (data.lang !== lang) errors.push(`${lang}: lang mismatch`);
  if (data.heroImage !== expectedHero) errors.push(`${lang}: hero changed`);
  if (data.updatedDate !== "2026-06-02") {
    errors.push(`${lang}: updatedDate mismatch`);
  }
  if ([...String(data.description ?? "")].length > 120) {
    errors.push(`${lang}: description too long`);
  }
  if (!content.includes("https://")) errors.push(`${lang}: no external link`);
  if (!content.includes("](/")) errors.push(`${lang}: no internal link`);
}

if (errors.length > 0) {
  console.error(errors.map((error) => `- ${error}`).join("\n"));
  process.exit(1);
}

console.log("locale set is consistent");

Failure Modes to Show Claude Code Up Front

Failure	Result	Guardrail
Regex reads headings	Code fence text enters the table of contents	Traverse `heading` nodes
`tags` becomes a string	Filters and related posts break	Validate frontmatter types
Duplicate headings	Anchor links point to the wrong section	Generate slugs consistently
Raw HTML is trusted	XSS risk through attributes or tags	Sanitize with a schema
External links are not checked	Official docs move silently	Probe before publishing
Prompt scope is broad	Other workers’ files are modified	Lock `owned_files`

The failure examples matter because Claude Code responds better to explicit constraints than to taste-based review comments. “Make it high quality” is weak. “Do not parse headings with regex, preserve heroImage, keep description under 120 characters, and do not touch other slugs” is actionable.

Safe Prompt Contract

Use a prompt contract like this when refreshing published content in a busy repository.

task: "Refresh one published MDX article"
owned_files:
  - "site/src/content/blog-en/claude-code-markdown-processing.mdx"
preserve:
  - "slug path"
  - "heroImage"
  - "unrelated dirty files"
required:
  - "updatedDate: 2026-06-02"
  - "description <= 120 characters"
  - "AST-based Markdown checks"
  - "official external links"
  - "internal links"
  - "monetization CTA"
forbidden:
  - "regex-only heading parsing"
  - "raw HTML without sanitization"
  - "thin locale summaries"
verification:
  - "node scripts/check-code-fences.mjs"
  - "node scripts/check-updated-article-quality.mjs"

Publish Checks and CTA

Before publishing, run both local scripts and a human pass. The machine pass checks fences, metadata, links, and article depth. The human pass checks whether the examples fit the reader, whether paragraphs are short enough on mobile, and whether the CTA is natural.

node tools/audit-mdx.mjs site/src/content/blog-en/claude-code-markdown-processing.mdx
node tools/check-locales.mjs
node scripts/check-code-fences.mjs
node scripts/check-updated-article-quality.mjs

For monetization, keep the next step contextual. Individual users can start with the free Claude Code cheatsheet. Readers who want repeatable review and writing prompts can use Claude Code prompt templates. Teams that need permissions, CI checks, locale workflow, and review habits should use Claude Code training and consultation.

Hands-On Verification Note

For this refresh, Masa treated the article as a real content pipeline problem rather than a prose-only rewrite. The most useful checks were: description length, updatedDate, hero image preservation, code-fence languages, official external links, locale depth, and internal CTA links. The AST-based audit caught the category of mistakes that regex would miss, especially headings inside code blocks and MDX syntax near components. The final local commands were node scripts/check-code-fences.mjs and node scripts/check-updated-article-quality.mjs. The main lesson is that Claude Code becomes reliable when the article contract is executable, not when the prompt merely asks for better writing.

Safe Markdown and MDX Processing with Claude Code

Why Markdown Processing Needs More Than Regex

Choose the Parser by the Job

Four Practical Use Cases

Copy-Paste Setup

Example 1: Audit Frontmatter, Headings, Fences, and Links

Example 2: Convert Markdown to Safe HTML

Example 3: Check Locale Files for the Same Slug

Failure Modes to Show Claude Code Up Front

Safe Prompt Contract

Publish Checks and CTA

Hands-On Verification Note

Free PDF: Claude Code Cheatsheet

Level up your Claude Code workflow

Related Posts

Claude Code Permission Safety Ladder: Expand Access Without Losing Control

Claude Code Small PR Proof Pack: Make Tiny Changes Reviewable

Claude Code Review Gate Before Commit: Diff, Tests, Public URL, and CTA Checks

Related Products

50 Battle-Tested Claude Code Prompt Templates

The Complete Claude Code Setup & Configuration Guide