Performance Regression Detection (Benchmark)
Core Concepts / How It Works
Establishing a Performance Baseline
The Benchmark skill first measures the current state of performance and saves it as a baseline. Without a baseline, there's no way to judge whether something is "faster" or "slower." Typical measurement targets include:
- Page load time: Time to First Byte (TTFB), First Contentful Paint (FCP), Largest Contentful Paint (LCP)
- Core Web Vitals: LCP / FID (Interaction to Next Paint, INP) / CLS
- Resource size: Total bundle JS, CSS, images, and fonts
- Lighthouse score: Performance / Accessibility / Best Practices / SEO
When to Use
- When you wonder "why is this page so slow?"
- When you want to verify the performance impact of a PR in numbers before merging
- When you want to regularly track Lighthouse scores and Core Web Vitals like LCP, FID, and CLS
- When monitoring whether bundle size is silently growing with each deployment
- When you receive tasks containing keywords like "performance", "benchmark", "page speed", "lighthouse", "web vitals", "bundle size", "load time"
Browse Daemon Integration
This skill uses a Browse daemon internally. The Browse daemon controls a headless browser to simulate a real user environment and collects screenshots and performance metrics. Measurements can be automated via CLI commands or WebSocket messages.
Per-PR Comparison (Before/After)
Once you have a baseline, you can run the same measurements again each time a PR branch is deployed and output a diff table of numerical changes. For example, if LCP increases from 1.8s to 2.4s, a regression is flagged for that PR.
Trend Tracking
Beyond single comparisons, storing measurement history as a time series lets you see which commit caused a performance drop on a graph. Integrating into a CI pipeline enables continuous monitoring of main branch performance trends.
One-Line Summary
A skill that uses the Browse daemon to establish performance baselines (page load time, Core Web Vitals, bundle size, etc.) and automatically detect performance regressions by comparing before/after for each PR.
Getting Started
/benchmarkSKILL.md file location: ~/.claude/skills/benchmark/SKILL.md
Copy and modify the SKILL.md content if customization is needed.
Practical Example
Situation: Club members have reported that the notice list page (/notices) of a "Student Club Notice Board" built with Next.js 15 + TypeScript feels slower after deployment. You need to identify which PR caused it.
Step 1 — Measuring the Baseline
# In a Claude Code session
> Use the benchmark skill to measure the current performance baseline of the /notices pageClaude starts the Browse daemon and returns measurements like this:
[Benchmark] Baseline established: /notices
┌─────────────────────┬──────────┐
│ Metric │ Current │
├─────────────────────┼──────────┤
│ TTFB │ 312 ms │
│ FCP │ 890 ms │
│ LCP │ 1 820 ms │
│ CLS │ 0.04 │
│ JS bundle (gzipped) │ 142 KB │
│ Lighthouse Perf │ 78 │
└─────────────────────┴──────────┘
Baseline saved → .benchmark/baseline-notices.jsonStep 2 — Checking Out the Suspect PR and Comparing
git checkout feature/add-realtime-notifications
pnpm build && pnpm start &
# In a Claude Code session
> Compare the branch I just built with the baseline[Benchmark] Comparison result: /notices (baseline vs feature/add-realtime-notifications)
┌─────────────────────┬──────────┬──────────┬──────────────────────┐
│ Metric │ baseline │ PR │ Change │
├─────────────────────┼──────────┼──────────┼──────────────────────┤
│ LCP │ 1 820 ms │ 2 490 ms │ +37% ⚠ Regression │
│ JS bundle (gzipped) │ 142 KB │ 218 KB │ +53% ⚠ Critical │
│ Lighthouse Perf │ 78 │ 61 │ -17 pts ⚠ Regression │
└─────────────────────┴──────────┴──────────┴──────────────────────┘
Suspected cause: socket.io-client (76 KB) included in bundle — needs investigationStep 3 — Code Fix and Improvement
// app/notices/page.tsx — Remove unnecessary client-side import in Server Component
// Before: socket client imported in a server component
// After: split with dynamic import + ssr: false
import dynamic from 'next/dynamic'
// Socket connection only needed on client → code splitting
const RealtimeNotifier = dynamic(
() => import('@/components/RealtimeNotifier'),
{ ssr: false }
)
export default async function NoticesPage() {
const notices = await fetchNotices() // fetch initial data on server
return (
<>
<NoticeList notices={notices} />
<RealtimeNotifier /> {/* client-only, bundle separated */}
</>
)
}Learning Points / Common Pitfalls
- "Gut feel" must be converted to numbers before discussion is possible: If a club member says "it's slow" but you don't know what the LCP is in ms, you can't verify whether it's improved. Benchmark is the first step in converting subjective feedback into objective data.
- Common mistake — measuring only in development environment: HMR mode in
pnpm devdoesn't apply production bundle optimizations, distorting measurements. Always measure usingpnpm build && pnpm start(or a staging URL). - Next.js 15 tip —
next/bundle-analyzer: When a bundle size regression is detected, immediately runANALYZE=true pnpm buildto visually identify which package grew. - CI integration tip: Running Benchmark for every PR in GitHub Actions and marking the PR check as failed when LCP exceeds +20% ensures performance regressions aren't missed by humans.
- Core Web Vitals thresholds: By Google's standard, LCP ≤ 2.5s, INP ≤ 200ms, CLS ≤ 0.1 is "Good." Using these numbers as thresholds makes pass/fail judgments clear.
Related Resources
- browse — Quick browser validation for a single scenario (the engine behind Benchmark)
- qa — Full site quality scoring
- canary — Real-time post-deployment performance regression detection
| Field | Value |
|---|---|
| Original URL | https://docs.anthropic.com/en/docs/claude-code/skills |
| Author/Source | Anthropic (gstack ecosystem) |
| License | Commentary MIT, original for reference only |
| Translation Date | 2026-04-13 |