Skip to main content

How we benchmark React libraries: methodology and tools

Learn the 5-axis framework we use to benchmark React libraries. Covers bundle analysis, runtime profiling, accessibility audits, and statistical rigor.

DomiDex
DomiDexCreator of Tour Kit
April 8, 202613 min read
Share
How we benchmark React libraries: methodology and tools

How we benchmark React libraries: methodology and tools

Most library benchmarks are theater. Someone runs a single vite build, screenshots the output size, and declares victory. No confidence intervals. No controlled environment. No mention of what they actually measured.

We publish comparison articles on this blog, and we got tired of other benchmarks that hand-wave through methodology. So we built a protocol. Five measurement axes, statistical significance requirements, and reproducible test setups that anyone can run themselves.

npm install @tourkit/core @tourkit/react

This article documents the exact methodology behind every benchmark we publish. We built Tour Kit, so any comparison involving it comes with built-in bias. Publishing our methodology is how we keep ourselves honest.

What is a library benchmark methodology?

A library benchmark methodology is a documented, reproducible protocol for measuring how a React library affects your application across five dimensions: bundle weight, runtime speed, accessibility compliance, developer experience, and long-term maintenance health. Unlike ad hoc benchmarks that test one metric in isolation, a methodology defines controlled environments, statistical thresholds, and measurement tools before any data collection begins. As of April 2026, fewer than 5% of "benchmark comparison" blog posts in the React ecosystem describe their methodology in enough detail to reproduce the results (Smashing Magazine, 2022).

Why benchmarking methodology matters for React teams

Picking a React library based on a single metric (usually GitHub stars or a blog post's bundle size screenshot) is how teams end up with regret six months later. A disciplined benchmark methodology prevents three specific failures: choosing a library that tanks your Core Web Vitals in production, missing accessibility violations that only surface during audits, and ignoring maintenance decay that leaves you stuck on a dead project. According to HTTP Archive field data, median page JavaScript weight hit 509KB in 2025, and every library you add compounds that problem. The methodology below is how we separate signal from noise.

Why single-run benchmarks fail

Running vite build once and comparing output sizes tells you almost nothing. JavaScript engines apply JIT optimizations that vary between runs. Background processes steal CPU cycles unpredictably. Garbage collection timing shifts results by 10-30% across identical executions on the same machine.

Nolan Lawson, who built Google's Tachometer benchmarking tool, catalogs the common traps: "Measuring unintended code paths, confirmation bias ('got the answer you wanted, so you stopped looking'), cached performance skewing results, JavaScript engine optimizations eliminating test code, inadequate sample sizes" (source).

The fix isn't complicated. Run enough iterations to reach statistical significance. Interleave tests so environment drift affects all candidates equally. Report confidence intervals, not just averages.

The Node.js project requires an independent 2-sample t-test with p < 0.05 before accepting any benchmark claim. We apply the same standard.

The five-axis evaluation framework

A complete React library evaluation requires measuring five distinct axes: bundle weight, runtime performance, accessibility compliance, developer experience, and maintenance health. Testing only bundle size or only runtime speed gives you a distorted picture because a 3KB library that fails WCAG audits isn't actually better than a 12KB one that ships accessible. Most published benchmarks cover one axis, maybe two. We cover all five.

AxisWhat we measurePrimary toolPass threshold
Bundle weightGzipped production size, tree-shaking effectiveness, dependency countsource-map-explorer + bundlephobia<15KB gzipped for a tour library
Runtime performanceInitialization time, re-render cost, INP impact, memory allocationTachometer + Chrome DevToolsINP <200ms with library active
Accessibilityaxe-core violations, keyboard navigation, screen reader announcementsaxe-core + manual auditZero critical/serious violations
Developer experienceTime-to-first-component, TypeScript coverage, API surface areaStopwatch + TS compilerUnder 30 min to working tour
Maintenance healthCommit frequency, open issue age, React 19 support, breaking changes per yearGitHub API + npm registryActive within 90 days

No library wins all five. Tour Kit scores well on bundle weight and accessibility but has a smaller community than React Joyride (603K weekly npm downloads as of April 2026). That's a real tradeoff, and a methodology that ignores it is just marketing.

Axis 1: bundle weight analysis

Bundle weight analysis measures the gzipped bytes a library adds to your production JavaScript after tree-shaking, not the number reported on its npm page or GitHub README. Bundle size is the most commonly benchmarked metric and the most commonly botched. Bundlephobia gives you a pre-install estimate, but your actual production cost depends on tree-shaking, your bundler configuration, and which entry points you import.

We measure three things for each library:

  1. Bundlephobia baseline: the published package size as a sanity check
  2. Production build size: vite build with the library imported into an identical test app, measured via source-map-explorer against the generated source maps
  3. Tree-shaking effectiveness: import a single function, then measure whether unused code gets eliminated

The test app matters. We use a Vite 6.2 project with React 19.1 and TypeScript 5.7. Each library gets an identical 5-step tour targeting the same DOM elements. Same Tailwind config, same tsconfig, same Vite plugins.

// benchmark/measure-bundle.ts
import { execSync } from "node:child_process";

const libraries = ["@tourkit/core", "react-joyride", "shepherd.js", "driver.js"];

for (const lib of libraries) {
  // Clean install with only the target library
  execSync(`rm -rf node_modules/.vite`);
  execSync(`vite build --mode production`);

  // Analyze with source-map-explorer
  const output = execSync(
    `npx source-map-explorer dist/assets/*.js --json`
  ).toString();

  const data = JSON.parse(output);
  const libBytes = data.results
    .filter((r: { bundleName: string }) => r.bundleName.includes(lib))
    .reduce((sum: number, r: { totalBytes: number }) => sum + r.totalBytes, 0);

  console.log(`${lib}: ${(libBytes / 1024).toFixed(1)}KB (raw), source-map-explorer`);
}

Cross-check everything against bundlephobia. If your measured size diverges by more than 20% from bundlephobia's estimate, something is wrong with your build config.

Axis 2: runtime performance profiling

Runtime performance profiling measures how a React library affects initialization speed, interaction responsiveness (INP), memory allocation, and re-render cost under controlled conditions with CPU throttling that simulates real-world devices. Statistical rigor matters most here because a single measurement is noise, not data.

We use Tachometer for automated benchmarks because it runs iterations until reaching statistical significance and launches fresh browser profiles between runs. No JIT warmup bleeding between measurements. No cached state skewing results.

For manual profiling, Chrome DevTools Performance tab with 4x CPU throttling simulates a mid-range Android device (roughly Moto G4 level). React 19.2+ adds Performance Tracks that surface React-specific events directly in the DevTools timeline. Component renders, state updates, and Suspense boundaries appear as labeled spans instead of anonymous function calls.

What we measure:

  • Initialization time: mount the tour provider, measure time to first interactive step
  • Step transition cost: navigate between steps, measure INP (Interaction to Next Paint)
  • Memory allocation: heap snapshots before and after a complete tour run
  • Re-render impact: React Profiler API measuring render counts and durations during tour navigation

Google evaluates Core Web Vitals at the 75th percentile of real visitor data (web.dev). We apply the same standard: p75, not averages, across at least 30 iterations.

As of 2026, INP (Interaction to Next Paint) has replaced FID as the responsiveness metric. The threshold is 200ms for "good." A tour library that pushes your app above 200ms INP is adding real user-facing latency, regardless of how small its bundle is.

Axis 3: accessibility auditing

Accessibility auditing for React libraries combines automated axe-core scanning (which catches 30-50% of WCAG violations) with manual keyboard navigation testing, screen reader verification, and focus management validation to produce a complete compliance picture. Lighthouse gives you a number. That number is incomplete. As Chrome for Developers states: "While Lighthouse can help you improve the accessibility of your website, you can't rely on it to find all potential accessibility issues, which requires manual testing" (source).

Our accessibility benchmark has two layers:

Automated (axe-core): We mount each library's tour component in a Playwright test, trigger axe-core, and count violations by severity (critical, serious, moderate, minor). The test runs against an identical page with the same DOM structure.

// benchmark/a11y-audit.ts
import { test, expect } from "@playwright/test";
import AxeBuilder from "@axe-core/playwright";

test("tour accessibility audit", async ({ page }) => {
  await page.goto("/benchmark/tour-active");
  await page.waitForSelector("[data-tour-step]");

  const results = await new AxeBuilder({ page })
    .include("[data-tour-overlay], [data-tour-step], [role='dialog']")
    .analyze();

  console.log(`Critical: ${results.violations.filter(v => v.impact === "critical").length}`);
  console.log(`Serious: ${results.violations.filter(v => v.impact === "serious").length}`);

  expect(results.violations.filter(v => v.impact === "critical")).toHaveLength(0);
});

Manual audit: Automated tools catch roughly 30-50% of real accessibility issues. We also test keyboard navigation (Tab/Escape/Arrow keys), screen reader announcements (VoiceOver on macOS, NVDA on Windows), and focus trap behavior. These results go into the comparison table as pass/fail per library.

Axis 4: developer experience metrics

Developer experience benchmarking quantifies how quickly a developer can go from npm install to a working implementation by measuring time-to-first-component, lines of code for equivalent functionality, TypeScript autocompletion coverage, and total API surface area. "Nice API" means nothing without numbers. So we measure concrete proxies:

  • Time-to-first-tour: from npm install to a working 3-step tour, measured with a stopwatch on a fresh project. We do this three times per library and report the median
  • Lines of code: for an identical 5-step tour with tooltip, highlight, and navigation buttons
  • TypeScript coverage: does the library export types? Does autocomplete work for step configuration? Are event callbacks typed?
  • API surface area: total exports count. Fewer exports usually means a more focused API

Stephane Goetz at Swissquote Engineering discovered a subtle trap here: one library appeared fastest because it skipped number formatting that other libraries performed. "We tend to have a confirmation bias toward finding the test that puts your library in the best light" (source). When measuring DX, verify you're comparing equivalent functionality, not a simpler API that just does less.

Axis 5: maintenance health signals

Maintenance health signals track whether a library's development is active and sustainable by monitoring commit frequency, open issue response time, React version compatibility, breaking change rate, and transitive dependency health. A library can ace bundle size and runtime performance today and be abandoned tomorrow. We pull maintenance data from the GitHub API and npm registry:

  • Last commit date: anything older than 90 days gets flagged
  • Open issue median age: how fast does the maintainer respond?
  • React version support: does it work with React 19? Does it use deprecated APIs like findDOMNode or componentWillMount?
  • Breaking changes per year: frequent major versions signal an unstable API
  • Dependency health: are its transitive dependencies maintained?

This data changes constantly. We re-pull it for every published comparison and date-stamp the results. A maintenance assessment from six months ago is already stale.

Setting up a reproducible test environment

A reproducible benchmark test environment is a single controlled project where the only variable that changes between test runs is the library under evaluation, while the page layout, DOM structure, bundler config, and browser profile remain identical. Reproducibility is what separates a benchmark from an opinion. Our test project is a Vite app with one route per library. Every route renders the same page layout, same DOM elements, same Tailwind styles.

The environment spec:

  • Runtime: Vite 6.2, React 19.1, TypeScript 5.7
  • OS: Ubuntu 24.04 (CI) / macOS 15 (local verification)
  • Browser: Chrome 131 (headless for CI, headed for profiling)
  • CPU throttling: 4x slowdown via Chrome DevTools Protocol
  • Iterations: minimum 30, or until Tachometer reports statistical significance
  • Isolation: fresh browser profile per iteration, node_modules cached but .vite cache cleared

We run benchmarks in GitHub Actions for CI consistency and verify locally on a MacBook Pro M3. If CI and local results diverge by more than 15%, we investigate before publishing.

Common mistakes we've learned to avoid

Benchmarking development builds. Don't. React's development mode adds timing instrumentation, StrictMode double-renders, and console warnings that don't exist in production. Always benchmark NODE_ENV=production builds.

Measuring the wrong thing. One library looked 40% slower until we realized our test page had a layout shift triggering extra re-renders unrelated to the tour library. Isolate the variable you're testing.

Ignoring tree-shaking. A library reporting 8KB on bundlephobia might actually contribute 15KB to your production bundle if it doesn't tree-shake well with your specific bundler configuration. Measure your actual build output, not the published package size.

Skipping accessibility. This one bit us early. A library can score perfectly on bundle and runtime axes while shipping inaccessible overlays that trap focus incorrectly. Tour Kit's core is under 8KB gzipped, but that size advantage would be meaningless without proper ARIA and focus management.

Tools we use (and why)

CategoryToolWhy this one
Bundle analysissource-map-explorerPer-file byte attribution via source maps. More accurate than webpack-bundle-analyzer for measuring a single dependency's contribution
Pre-install sizingBundlephobiaQuick cross-check. Not authoritative for tree-shaken builds
Runtime benchmarksTachometerAuto-determines sample size, interleaves runs, reports confidence intervals. Built by Google's Chrome team
ProfilingChrome DevTools + React ProfilerReact Performance Tracks (19.2+) show component-level timing in the Performance panel
Accessibilityaxe-core via PlaywrightSame engine that powers Lighthouse a11y scoring, but run in real browser DOM for accurate results
CWV measurementweb-vitals library + CrUXLab data (web-vitals) for development, field data (CrUX) for production validation
CI integrationGitHub ActionsConsistent environment. Ubuntu runners eliminate local machine variance

Some tools we evaluated and skipped: Benchmark.js works well for micro-benchmarks (operations/second) but doesn't handle browser-level measurements like INP or CLS. Hyperfine is excellent for CLI tools but irrelevant for React component benchmarks. Million Lint auto-detects React performance issues but doesn't provide the controlled comparison framework we need.

Applying this to product tour libraries

Nobody benchmarks product tour libraries rigorously. The space is small enough that most teams pick a library based on GitHub stars and a quick README scan. But tour libraries affect your app in ways that generic UI components don't:

  • They render overlays that cover the entire viewport, affecting CLS and INP
  • They manage focus traps that can break keyboard navigation for the whole page
  • They use scroll manipulation to bring target elements into view, which triggers layout recalculations
  • They often observe DOM mutations via ResizeObserver and MutationObserver, adding persistent runtime cost

Our 2026 benchmark applies this 5-axis framework to five libraries: React Joyride, Shepherd.js, Tour Kit, Driver.js, and Intro.js. The methodology article you're reading now is the foundation for that comparison, and for every comparison we publish going forward.

FAQ

How many iterations should a JavaScript benchmark run?

A JavaScript benchmark should run enough iterations for the statistical test to reach significance. Tachometer auto-determines this, running until the 95% confidence interval is narrow enough to draw a conclusion. As a floor, 30 iterations gives you enough data for a meaningful t-test. Complex benchmarks with high variance may need 100+.

What tools measure React component bundle size accurately?

Source-map-explorer on your actual production build is the most accurate tool for measuring a library's real bundle contribution after tree-shaking. Bundlephobia gives a quick pre-install estimate but doesn't account for your bundler's tree-shaking behavior. Webpack Bundle Analyzer and Rollup Plugin Visualizer provide similar treemap views at build time.

Does Lighthouse catch all accessibility issues?

Lighthouse catches roughly 30-50% of accessibility issues through its axe-core integration. Automated testing misses interaction patterns like keyboard trap behavior, screen reader announcement quality, and focus management during dynamic content changes. Tour Kit targets WCAG 2.1 AA compliance, which requires both automated scanning and manual testing with assistive technologies.

What replaced FID in Core Web Vitals?

INP (Interaction to Next Paint) replaced FID as the Core Web Vitals responsiveness metric in March 2024. Unlike FID, which only measured the first interaction's delay, INP tracks all interactions throughout the page lifecycle and reports the worst at the 75th percentile. The "good" threshold is under 200ms.

How often should library benchmarks be updated?

Library benchmarks should be updated whenever a compared library ships a major version, or at minimum every 90 days. Stale benchmarks mislead readers and lose search visibility. We re-run ours monthly and date-stamp every data point.


JSON-LD Schema:

{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "headline": "How we benchmark React libraries: methodology and tools",
  "description": "Learn the 5-axis framework we use to benchmark React libraries. Covers bundle analysis, runtime profiling, accessibility audits, and statistical rigor.",
  "author": {
    "@type": "Person",
    "name": "Tour Kit Team",
    "url": "https://tourkit.dev"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Tour Kit",
    "url": "https://tourkit.dev",
    "logo": {
      "@type": "ImageObject",
      "url": "https://tourkit.dev/logo.png"
    }
  },
  "datePublished": "2026-04-08",
  "dateModified": "2026-04-08",
  "image": "https://tourkit.dev/og-images/benchmark-react-library-methodology.png",
  "url": "https://tourkit.dev/blog/benchmark-react-library-methodology",
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://tourkit.dev/blog/benchmark-react-library-methodology"
  },
  "keywords": ["benchmark react library methodology", "react library performance testing", "bundle size benchmark tools", "core web vitals react library"],
  "proficiencyLevel": "Intermediate",
  "dependencies": "React 18+, TypeScript 5+",
  "programmingLanguage": {
    "@type": "ComputerLanguage",
    "name": "TypeScript"
  }
}

Internal linking suggestions:

Distribution checklist:

  • Dev.to (cross-post with canonical URL)
  • Hashnode (cross-post with canonical URL)
  • Reddit r/reactjs ("How we benchmark React libraries — 5-axis framework with statistical rigor")
  • Reddit r/webdev
  • Hacker News (if reception is strong on Reddit)

Ready to try userTourKit?

$ pnpm add @tour-kit/react