
How to A/B test product tours (complete guide with metrics)
Most teams measure whether users finish a product tour. That's the wrong metric. A tour someone clicks through just to dismiss it shows 100% completion and zero activation. The real question isn't "did they finish?" but "did they do the thing the tour was supposed to teach them?"
As of April 2026, the median completion rate for a 5-step product tour is 34% (Product Fruits, 2026). But that number means nothing without knowing what happened after. This guide covers how to set up A/B tests that measure actual outcomes, not vanity completion rates.
npm install @tourkit/core @tourkit/reactTour Kit is our project, and it's what we use in the code examples below. The methodology applies to any tour library or SaaS tool. The principles don't change based on your stack.
What is A/B testing for product tours?
A/B testing for product tours means showing two or more variants of the same onboarding flow to different user segments, then measuring which variant drives the intended behavior. One group sees variant A (the control, your current tour) while the other sees variant B (the experiment, a modified version). You run the test until you reach statistical significance at a 95% confidence level, then ship the winner. Tour experiments carry an extra constraint that landing page tests don't: both variants must maintain accessibility compliance and avoid disrupting the user's primary workflow.
The concept is straightforward. The execution is where teams go wrong.
Why A/B testing product tours matters for activation
Teams that ship product tours without testing them are guessing. According to the 2026 State of Customer Onboarding report, 57% of leaders say onboarding friction directly impacts revenue realization (OnRamp, 2026). A tour that confuses users instead of activating them doesn't just fail silently; it actively pushes new signups toward churn. A/B testing replaces gut feelings with measured outcomes, letting you iterate on the one touchpoint that every new user encounters.
Product Fruits found that removing friction from onboarding flows improved completion by 22%, while issue-based fixes reduced churn by 18%. Those numbers come from companies that tested. Teams that don't test ship the same underperforming tour for months without knowing it's broken.
Why most teams measure the wrong thing
Tour completion rate is the default metric in every onboarding analytics dashboard. Appcues shows it. Pendo shows it. UserGuiding shows it. And it's the wrong primary metric for A/B tests.
Here's why. A tour that auto-advances on a timer will show higher completion than one that waits for user interaction. A tour with a prominent "Skip" button will show lower completion than one that buries the dismiss option. Neither of those signals tells you whether the user learned anything.
Milan, a DAP expert with experience across WalkMe, Pendo, and Appcues, put it directly on the Intercom community forum: "there is no single answer, not even a range of % of completion you should expect" (Intercom Community, 2026). Benchmarks don't exist because context varies too much.
So what should you measure?
Choosing primary and secondary metrics
The primary metric for any product tour A/B test should be the downstream activation event, meaning the action the tour was designed to teach. If your tour walks users through creating their first dashboard, the primary metric is "created first dashboard within 24 hours." Not "finished tour." Completion rate belongs in the secondary column because it measures attention, not learning. A 5-step tour with 80% completion and 12% activation is worse than one with 40% completion and 30% activation.
Secondary metrics provide supporting context:
| Metric | Type | What it tells you | Watch out for |
|---|---|---|---|
| Activation event rate | Primary | Did the tour teach the intended behavior? | Set a time window (24h, 48h, 7d) and stick to it |
| Tour completion rate | Secondary | Did users reach the final step? | High completion + low activation = bad tour |
| Step drop-off rate | Secondary | Where do users abandon? | Some drop-off is healthy (not every user needs every step) |
| Time to activation | Secondary | Does the tour speed up the path? | Faster isn't always better; comprehension matters |
| Support ticket volume | Secondary | Did the tour reduce confusion? | Lag indicator; needs 2-4 weeks of data |
Product Fruits confirmed this framing in their 2026 best practices report: "Tours stop being 'a tour' and become a system: adaptive, segmented, and increasingly personalized" (Product Fruits, 2026). Treating tours as independent products with their own test cycles means holding them to product-level metrics, not completion percentages.
Setting up your first product tour A/B test
Running a product tour A/B test requires five phases: establishing a baseline, forming a hypothesis, calculating sample size, implementing with feature flags, and committing to a fixed test duration without peeking at intermediate results. Most failed experiments skip phase one or three, which poisons every conclusion that follows. Here's the full sequence.
1. Establish the baseline
Run your current tour unchanged for at least two weeks. Measure the activation event rate (not completion) for users who saw the tour. This is your control group's expected performance.
2. Form a hypothesis
"Replacing the 7-step linear tour with a 3-step contextual tour will increase first-dashboard creation from 28% to 35% within 48 hours." Be specific about the metric, the expected lift, and the time window.
3. Calculate sample size
You need enough users in each variant to reach 95% confidence. For a B2B SaaS app with 500 daily active users where the baseline activation rate is 28% and you want to detect a 7-percentage-point lift:
| Parameter | Value |
|---|---|
| Baseline conversion | 28% |
| Minimum detectable effect | 7 percentage points (25% relative lift) |
| Confidence level | 95% |
| Statistical power | 80% |
| Required sample per variant | ~380 users |
| Total users needed | ~760 |
| Estimated test duration (500 DAU) | ~11 days (assuming 70% of DAU are eligible) |
Most A/B testing calculators assume e-commerce traffic levels. A SaaS app with 500 DAU and a 70% eligibility rate means only 350 users per day enter the test. Budget 11 days minimum for a two-variant test. Smaller effects require larger samples. Detecting a 3-point lift instead of 7 would take roughly 60 days at the same traffic.
4. Implement with feature flags
Feature flags are the cleanest way to split traffic for tour variants. They keep test logic out of your component tree and make cleanup straightforward when the test ends.
// src/components/OnboardingTour.tsx
import { useTour } from '@tourkit/react';
import { useFeatureFlag } from './your-flag-provider';
export function OnboardingTour() {
const variant = useFeatureFlag('onboarding-tour-experiment');
// variant: 'control' | 'short-contextual' | undefined
const controlSteps = [
{ target: '#sidebar-nav', content: 'Start by exploring the sidebar navigation.' },
{ target: '#create-btn', content: 'Click here to create your first dashboard.' },
{ target: '#template-picker', content: 'Pick a template to get started quickly.' },
{ target: '#widget-panel', content: 'Drag widgets from this panel.' },
{ target: '#save-btn', content: 'Save your dashboard when you are done.' },
];
const shortSteps = [
{ target: '#create-btn', content: 'Create your first dashboard in under a minute.' },
{ target: '#template-picker', content: 'Templates handle the layout. Pick one.' },
{ target: '#save-btn', content: 'Hit save. You can always edit later.' },
];
const steps = variant === 'short-contextual' ? shortSteps : controlSteps;
const tour = useTour({
tourId: `onboarding-${variant ?? 'control'}`,
steps,
onComplete: () => {
// Fire analytics event with variant for segmentation
trackEvent('tour_completed', { variant: variant ?? 'control' });
},
});
return <>{tour.render()}</>;
}Both PostHog and GrowthBook support this pattern with their React SDKs. The flag decides which steps array to use. The tour component itself doesn't know it's being tested. It just renders whatever steps it receives.
5. Run, wait, and don't peek
The peeking problem is the most common cause of invalid A/B test results. Checking results daily and stopping the test when it "looks good" inflates your false-positive rate from 5% to as high as 30%.
Set the test duration upfront based on your sample size calculation. Don't check intermediate results. If your tool shows a "significance" badge before the planned end date, ignore it.
Implementation patterns for React SPAs
React single-page applications introduce three A/B testing challenges that server-rendered pages don't face: hydration timing causes variant flickering, route changes can reset tour state, and dead test code accumulates across your component tree. Each one can silently corrupt your experiment data if you don't account for it upfront.
Hydration timing. If your flag provider hasn't loaded when the tour mounts, users see a flash of the wrong variant. Wrap the tour in a loading check:
// src/components/SafeTour.tsx
import { useTour } from '@tourkit/react';
import { useFeatureFlagLoading } from './your-flag-provider';
export function SafeTour({ steps }: { steps: TourStep[] }) {
const flagsReady = useFeatureFlagLoading();
const tour = useTour({
tourId: 'onboarding',
steps,
enabled: flagsReady, // Don't start until flags resolve
});
if (!flagsReady) return null;
return <>{tour.render()}</>;
}Route-change persistence. Tours that span multiple pages need state that survives React Router navigations. Tour Kit stores progress in localStorage by default, but your flag provider must also maintain the same variant assignment across routes. Sticky bucketing (assigning a user to a variant once and remembering it) is non-negotiable for SPA tour tests.
Cleanup after tests conclude. One developer on dev.to described the accumulation problem well: "A/B testing is a powerful tool, but if you do not pay enough attention, your code transforms in a spaghetti restaurant" (bgadrian, dev.to). When a test ends, remove the losing variant's code, delete the feature flag, and update the tour to use the winning steps directly. Don't leave dead test branches in your codebase.
Accessibility compliance across both variants
Every A/B test variant shown to real users must independently meet WCAG 2.1 AA compliance, yet no competing guide on product tour experimentation addresses this requirement. Running an accessibility-broken variant on production traffic isn't just bad UX. In regulated industries like fintech and healthcare, it's a compliance risk that can trigger audit failures regardless of which variant wins the test.
Before launching any tour experiment, verify these five criteria against each variant independently:
- Focus moves to the tooltip when a step activates (not trapped on the underlying page)
- Users can advance, go back, and dismiss with keyboard alone (Tab, Enter, Escape)
- Step changes are announced to screen readers via ARIA live regions
- All text meets 4.5:1 contrast ratios against tooltip backgrounds
- Animations respect
prefers-reduced-motionin both variants
Tour Kit handles focus management, keyboard navigation, and ARIA announcements at the component level, so they don't change between variants. But if your variant B uses different colors, layout, or animation timing, you need to audit those independently.
A headless architecture makes this easier. Since the tour logic (step sequencing, focus trapping, ARIA attributes) lives in hooks, and the visual layer is your own components, changing the visual variant doesn't risk breaking the accessibility layer. Opinionated libraries that couple logic and UI make this harder because changing the appearance means potentially changing the accessibility behavior.
Tour Kit doesn't have a visual builder, which means you can't hand variant creation to a non-technical team. That's a real limitation. But it also means every variant goes through your component tree, your linter, and your accessibility tests before it reaches users.
Common mistakes that invalidate results
Five failure modes account for the majority of invalid product tour A/B tests. We've seen each one in real codebases, and they all share a common trait: the team trusted their tooling's "significant" badge instead of auditing their experimental design. Here's what to watch for.
Testing too many things at once. Changing the step count, the copy, and the visual style simultaneously makes it impossible to know which change caused the result. Change one variable per test.
Not accounting for new vs. returning users. A user who saw variant A on Monday and variant B on Wednesday pollutes both groups. Use sticky bucketing so that once a user is assigned to a variant, they stay there permanently.
Running tests during anomalous periods. Product launches, holidays, and marketing campaigns all skew onboarding traffic. Run tests during normal traffic patterns only.
Ignoring the novelty effect. A new tour variant will always outperform the old one initially because users pay more attention to something unfamiliar. Run tests for at least two full weeks to let the novelty wear off.
Optimizing for completion when activation is flat. If variant B shows 50% completion versus variant A's 34%, but activation rates are identical, variant B didn't win. It just produced a tour that users clicked through faster. Check the primary metric.
Tools for running product tour A/B tests
You don't need a dedicated tour testing platform. Any feature flag service with sticky bucketing and a statistical engine gives you everything required to run product tour experiments in a React app. Here are the four most common choices for developer-led teams, with pricing current as of April 2026.
| Tool | React SDK | Sticky bucketing | Statistical engine | Free tier |
|---|---|---|---|---|
| PostHog | Yes | Yes | Bayesian | 1M events/month |
| GrowthBook | Yes | Yes | Frequentist + Bayesian | Open source (self-host) |
| LaunchDarkly | Yes | Yes | Frequentist | No (starts at $8.33/seat/month) |
| Statsig | Yes | Yes | Bayesian | Yes (limited) |
PostHog and GrowthBook are the most popular picks in this group. Both integrate with Tour Kit's analytics callbacks (onStepView, onStepComplete, and onTourEnd) so you can pipe tour events directly into your experimentation dashboard without extra instrumentation. For the full integration walkthrough, see our PostHog + Tour Kit guide. If you want to compare tools with built-in A/B testing, our onboarding tools with A/B testing roundup covers seven options.
Key takeaways
- Your primary A/B test metric should be the downstream activation event, not tour completion rate. A tour nobody finishes but that drives 40% activation is better than a tour everyone completes that drives nothing.
- Calculate sample size before you start. A 500-DAU SaaS app needs roughly 11 days to detect a 7-point lift at 95% confidence. Smaller effects take proportionally longer.
- Use feature flags for variant assignment. They keep test logic separated from your component tree and make cleanup straightforward.
- Audit both variants for WCAG 2.1 AA compliance before launching the test. Accessibility isn't optional for either group.
- Don't peek at results mid-test. Set the duration upfront and wait.
Get started with Tour Kit. Install @tourkit/core and @tourkit/react from npm, or check the docs for the full API reference.
FAQ
How long should I run a product tour A/B test?
Test duration depends on your daily traffic and the effect size you want to detect. For a SaaS app with 500 daily active users testing a 7-percentage-point lift, plan for at least 11 days. Detecting a 3-point lift at the same traffic takes roughly 60 days. Never stop a test early because intermediate results look promising.
What's a good completion rate for a product tour?
The median completion rate for a 5-step product tour is 34% (Product Fruits, 2026). But completion alone is misleading: high completion with low activation means the tour isn't working. Use completion as a secondary metric and measure whether users performed the action the tour taught. No universal benchmark exists because context varies too much.
Can I A/B test product tours without a feature flag service?
Yes, but it's harder to maintain. You can randomize with a hash of the user ID and store the assignment in localStorage. The tradeoff: you lose cross-device consistency and automatic significance calculation. PostHog (free tier: 1M events/month) or GrowthBook (open source, self-hosted) provide sticky bucketing and statistical engines out of the box.
Should I A/B test the number of steps or the content?
Test one variable at a time. Changing both step count and copy simultaneously makes it impossible to attribute the result. Start with the highest-impact variable (typically step count or information order) and test content changes in a follow-up experiment.
How do I keep my A/B test accessible for screen reader users?
Both tour variants must meet WCAG 2.1 AA independently. Verify focus moves to each tooltip, keyboard navigation works for advancing and dismissing, ARIA live regions announce step changes, and contrast meets 4.5:1. Tour Kit handles focus and ARIA at the hook level, so visual variant changes don't break accessibility.
JSON-LD Schema
{
"@context": "https://schema.org",
"@type": "TechArticle",
"headline": "How to A/B test product tours (complete guide with metrics)",
"description": "Learn how to A/B test product tours with the right metrics. Covers experiment setup, sample size calculation, and feature flag integration for React apps.",
"author": {
"@type": "Person",
"name": "Domi",
"url": "https://usertourkit.com"
},
"publisher": {
"@type": "Organization",
"name": "Tour Kit",
"url": "https://usertourkit.com",
"logo": {
"@type": "ImageObject",
"url": "https://usertourkit.com/logo.png"
}
},
"datePublished": "2026-04-09",
"dateModified": "2026-04-09",
"image": "https://usertourkit.com/og-images/ab-test-product-tour.png",
"url": "https://usertourkit.com/blog/ab-test-product-tour",
"mainEntityOfPage": {
"@type": "WebPage",
"@id": "https://usertourkit.com/blog/ab-test-product-tour"
},
"keywords": ["ab test product tour", "onboarding ab testing", "product tour experiment", "product tour metrics"],
"proficiencyLevel": "Intermediate",
"dependencies": "React 18+, TypeScript 5+",
"programmingLanguage": {
"@type": "ComputerLanguage",
"name": "TypeScript"
}
}Internal linking suggestions
- Link FROM best-onboarding-tools-ab-testing: add "For methodology on how to run these tests, see our A/B testing guide"
- Link FROM feature-flag-product-tour: the A/B testing section there references experimentation
- Link FROM track-product-tour-completion-posthog-events: analytics setup feeds into A/B test measurement
- Link TO best-onboarding-tools-ab-testing: for readers who want tool recommendations
- Link TO product-tour-antipatterns-kill-activation: complements the "common mistakes" section
Distribution checklist
- Dev.to: full cross-post with canonical URL
- Hashnode: full cross-post with canonical URL
- Reddit r/reactjs: "How we A/B test product tours in our React app" (discussion format, not promotional)
- Reddit r/ProductManagement: the metrics framework angle resonates with PMs
- Hacker News: only if paired with a Show HN or original benchmark data
Related articles

The aha moment framework: mapping tours to activation events
Map product tours to activation events using the aha moment framework. Includes real examples from Slack, Notion, and Canva with code patterns for React.
Read article
Onboarding for AI products: teaching users to prompt
Build onboarding flows that teach AI product users to prompt. Covers the 60-second framework, template activation, and guided tour patterns with React code.
Read article
How to onboard users to a complex dashboard (2026)
Build dashboard onboarding that cuts cognitive load and drives activation. Role-based tours, progressive disclosure, and empty-state patterns with React code.
Read article
Contextual tooltips vs linear tours: when to use each
Data-backed decision framework for contextual tooltips vs linear product tours. Includes completion rate benchmarks, React code examples, and hybrid patterns.
Read article