Experimentation

How to Measure Real Impact (Not Just Winning Experiments)

"If every team at your company is winning — but revenue isn't growing — your impact measurement is broken."

I've seen this more times than I can count. Product teams run experiments and report lifts. Marketing runs campaigns and reports conversions. Growth runs tests and reports engagement gains. Every team is winning. Every quarterly review is full of green arrows. And yet, when you look at the topline, something isn't adding up.

The problem isn't a lack of effort or rigor at the team level. The problem is that local wins don't automatically translate into global wins. And most companies are measuring the former while assuming the latter.

Why local experiments mislead

Interaction effects and overlapping tactics

Every time a user visits your product, they are simultaneously exposed to dozens of experiments, campaigns, and product changes — all running at the same time across different teams. When you measure an experiment in isolation, you're measuring the effect of that change inside a controlled vacuum. But users don't live in a vacuum.

Experiment A might show a 5% lift in retention. Experiment B might show a 3% lift in conversion. But run them simultaneously on overlapping user populations, and the combined effect might be 4% — or even negative — because the tactics interact with each other in ways the individual experiments couldn't detect.

Most companies don't measure this. They report each experiment's result independently and sum up the lifts as if they were additive. They're not.

Topline revenue vs. profit

A metric going up isn't the same as the business making more money. A promotional campaign might drive a measurable conversion lift while costing more in discounts than it recovers in gross margin. A product experiment might increase engagement while engineering hours and infrastructure costs outpace any revenue upside.

Measuring impact at the topline requires accounting for costs: the operational cost of running a promotion, the engineering cost of implementing a change, the long-term customer value consequences of short-term behavioral shifts.

Time erodes incrementality

A cohort that converts at a higher rate in the first 30 days of an experiment may behave identically to the control group by day 90. Incrementality isn't a permanent property of a tactic — it fades. When companies extrapolate short-term experiment results forward indefinitely, they overcount impact by orders of magnitude.

This is especially common with retention and engagement tactics. A push notification campaign that re-engages lapsed users in week one often shows strong early lift — but the re-engaged users frequently return to their pre-campaign behavior within 60 days. The short-term experiment result was real. The implied annualized impact was fiction. Any honest measurement system needs to track incrementality over time, not just at the experiment's conclusion date.

The practical fix is to run longer holdout windows and to report incrementality at multiple time horizons — 30-day, 60-day, and 90-day — rather than closing out an experiment the moment it reaches statistical significance.

What real impact measurement looks like

Global holdout groups

The most reliable way to measure the collective, incremental impact of your business activities is a global holdout group: a set of users who are systematically excluded from all business tactics and messaging. Compare their behavior to your active users over time, and you have a clean measure of the total incremental impact your teams are driving.

This is different from a standard A/B test control group, which only measures the effect of a single change. A global holdout measures everything — every experiment, every campaign, every product change — simultaneously.

Holdout groups need to be designed carefully. They must be representative of your overall customer base, which requires thoughtful stratification. Random assignment works for large, homogeneous user bases; for smaller or more segmented populations, techniques like k-nearest-neighbor clustering produce more reliable holdout composition.

Critically, holdout group assignment should be managed centrally — not by individual teams. When marketing assigns its own control groups, product assigns its own, and growth assigns its own, you get confounded experiments and inflated results. Central management is the only way to maintain integrity across the measurement system.

Revenue per user as the north star

Most operational experiments — marketing campaigns, product changes, retention tactics — don't directly measure revenue. They measure proxies: acquisition rates, engagement metrics, retention percentages. These are useful signals, but they're not impact.

To understand true impact, you need to connect operational metrics to revenue, accounting for the full customer lifetime. A marketing experiment that acquires low-LTV users at a high conversion rate isn't a win if those users churn before they cover their acquisition cost. A product experiment that improves engagement among a segment that was never going to convert isn't moving the business.

Using revenue-per-user (and ideally, LTV-adjusted revenue per user) as the connecting metric forces teams to think about the downstream consequences of their experiments, not just the immediate measurement.

Centralized experimentation infrastructure

Individual teams owning their own experimentation creates the conditions for exactly the problems described above: overlapping populations, inconsistent randomization, misattributed results. Centralized experimentation infrastructure — a shared platform for experiment assignment, tracking, and analysis — is not a luxury for mature organizations. It's the foundation for any measurement system that can be trusted.

Centralization doesn't mean experimentation becomes a bottleneck. It means the rules of the road are consistent, the randomization is sound, and the results can be compared and combined across teams without methodological inconsistencies.

The organizational incentive problem

Understanding why impact gets overcounted isn't just an academic exercise. It matters because the incentive structures inside most companies actively produce this outcome.

Analytics and growth teams are typically evaluated on experiment velocity and lift metrics. Run more experiments, show more wins, report more impact. The KPIs that govern team performance are the exact same KPIs that the measurement problems above distort most severely. The result is a system that rewards the appearance of progress over actual progress — and nobody has to be dishonest for that to happen. It emerges naturally from the structure.

This creates a particularly difficult dynamic for the people who want to fix it. The analyst or data scientist who advocates for global holdouts and longer measurement windows is, in the short term, making the team's impact look smaller. The experiments that previously showed 8% lifts now show 3%, once interaction effects are accounted for. The annualized projections that supported last quarter's planning numbers look optimistic when you apply proper incrementality decay. Pointing this out is not a career-accelerating move in most organizations.

The fix has to come from above. Leadership has to decide that accurate measurement matters more than impressive-looking dashboards — and then change the incentives to reflect that. This means evaluating teams on business outcomes, not experiment counts. It means rewarding the discovery that a tactic isn't working as much as the discovery that one is. And it means treating global holdout data as a check on local results, not a threat to them.

None of this is easy. But organizations that make this shift stop chasing green arrows and start moving the business.

The close

Real impact isn't about winning experiments. It's about moving the business.

That distinction matters because the entire incentive structure of most analytics organizations is oriented around the former: teams are rewarded for showing lifts, reporting wins, and hitting experiment velocity targets. The collective result, too often, is a portfolio of locally successful experiments that add up to flat or declining business performance.

The fix is measurement infrastructure that operates at the level of the business, not the team: global holdout groups that capture total incrementality, revenue-based north-star metrics that account for costs and lifetime value, and centralized experimentation that prevents the cross-contamination that makes individual results meaningless.

It's more complex to build and maintain. But it's the only kind of measurement that actually tells you whether the work is working.

The practical starting point for most organizations isn't a fully-built global holdout program on day one. It's a single, well-designed holdout group for one high-priority business function — retention marketing, say, or product-led growth. Run it for a full quarter. Report the results alongside the team's local experiment results. Let the comparison speak for itself.

In most cases, the gap between local results and global incrementality is eye-opening. Teams that thought they were driving 12% revenue lift discover the real number is closer to 4%. That's a hard conversation — but it's the conversation that leads to better prioritization, more honest planning, and a measurement system the business can actually trust. You can't fix what you aren't measuring honestly. And honest measurement, even when the numbers are smaller than expected, is ultimately more valuable than optimistic accounting that quietly misleads the entire organization.

If your teams are all winning but the business isn't moving, let's figure out why. Book a free Revenue Audit →