Summary

  • Activity is not impact: Most enterprises track easy-to-measure metrics (tasks completed, bugs fixed) while ignoring whether AI actually improves outcomes and human judgment (what matters).
  • Perception gap: CTOs report high confidence in AI ROI, but workers closest to implementation tell a different story.
  • A better approach: Benchmark outcomes against professional practice (not activity), distinguish outcome metrics from activity metrics, start with the actual work being transformed, and acknowledge measurement limits

Imagine a restaurant where a single waiter is running from table to table, checking in, filling waters, placing new baskets of bread.

Now look closer. No one has their food. There’s a flurry of activity, but no one is getting what they came for – a meal. Right now, Enterprise AI initiatives are a bit like that restaurant. Lot’s of activity, little fulfillment of outcomes.

But being fed isn’t the only benchmark of a restaurant. There’s the quality of the food, the atmosphere, the satisfaction, the service. Being fed is simply the easiest thing to measure.

When it comes to benchmarking the success of Enterprise AI, it’s important to remember that activity does not equal impact.

If Enterprise AI is about developing a strategic, professional practice of infusing AI into your enterprise operations, that means it’s full of complexity. And like all the parts of a good restaurant experience, complexity is hard to measure.

Let’s talk about AI performance benchmarks

There’s no shortage of metrics touting the success of new AI initiatives and pilots across industries (and an equal amount about their failures). But at closer glance, it’s easy to wonder if these metrics were chosen simply because they're easy to capture.

When so much time, effort, and money is being poured into these initiatives, everyone has a vested interest in showing positive results. Hence, the focus on easily measured tasks – things like bug fixes, ticket automations, and data parsing.

It’s simpler to measure the success of bolt on tech solutions by the tasks completed and activity generated (which is how many enterprises are treating AI right now). Dashboards look good. Anecdotal reports are positive.

Lots of busy waiters with hungry patrons.

According to a report by Economist Enterprise, CTOs have far greater confidence in AI returns than those closest to the work where AI is being integrated. That means the people monitoring tasks and activities are touting results. The people actually doing that work have a much different story.

So what should be benchmarked?

How should the impact of AI be measured in the enterprise?

Benchmarks are meant to create accountability. They’re not simply the recording of completed tasks, but a marker of how the company’s defined goals are affected. That means AI pilots and initiatives should ladder up in support of established company goals, not become goals themselves.

For most knowledge work, benchmarking AI in this way requires isolating AI's contribution from everything else that affects business outcomes. It means having a clear plan for AI implementation and investing in its foundation. And it means making investments in the people who are expected to work with it.

All of this is hard: the judgment calls necessary for driving pilots, the expertise that needs to be developed among staff, the workers that need to be upskilled, the decisions nobody tracked but that eventually created clarity. Much of it is more closely tied to culture than technology. And most of it resists being turned into a clean, simple metric.

The implication here can be uncomfortable. Enterprises have built AI performance frameworks around what their systems can demonstrate, not what their organizations and employees may actually need to demonstrate success. The scorecard gets filled with the time saved while the ability of core team members to make better judgement goes unexamined.

None of this is a straight line. It’s more of a meandering river. Benchmarks should reflect that.

What Leaders Are Getting Right — And Where Metrics Miss

The organizations pulling ahead with Enterprise AI implementation are doing work around data infrastructure, cross-functional ownership, and refining processes. These investments are tangible and measurable, having significant impact. But there are also limits to what some of these measurements can capture.

For instance, many leaders say they’re redesigning workflows around AI. Org charts change. Titles are awarded. Promotions given. These are discrete, tangible, and easy to measure, making them easy to chalk up as successes.

But what about retraining? How are employees expected to work with AI to redesign their work? What new skills will they need? And how will their development be measured?

Whether people are generating better judgment alongside AI — whether human capability is actually growing — produces no visible, measurable artifacts. It accrues slowly, shows up indirectly, and almost never appears on a scorecard.

Let’s say a company starts using AI to help with sales pipeline management. They report that their AI lead-scoring pilot recommended 300 prospects this month, so the sales team is now spending 30% more time on outreach. On the surface, it seems great!

But a closer look reveals that while the sales team is reaching out to more people thanks to AI, the quality of their conversations hasn't improved. The metrics that really matter – deal sizes and close rates – are unchanged.

The thing that’s harder to measure, but is arguably more important, is whether these salespeople are learning to recognize signals of genuine fit so they can move deals through the pipeline faster. Real impact would be: "Our reps now consistently identify customer fit 2-3 calls earlier because they're trained to recognize what the AI is surfacing – and close rates on qualified deals are up 15%."

It's not that leaders don’t think these things are important. They do. But measuring them requires confronting the idea that much of what matters about AI deployment cannot be cleanly observed at all.

And this complexity is not all about culture. Other issues abound. AI governance, for instance, often lags behind AI development initiatives. It becomes critical to create frameworks and benchmarks that ensure responsible scalability and ethical actions while pursuing rapid innovation. Yet, too often, this work comes much later, if at all.

What Rigorous Practice Actually Looks Like

Accepting the idea that the most important things resist measurement doesn't mean abandoning the attempt to benchmark them.

It means practicing a harder discipline than most benchmarking guides describe. Here are some suggestions:

Start with the work being transformed, not the metric. And be honest about what the work actually involves. Map the professional judgment in a task before deciding what to measure. Some of what you map will be measurable. Some won't. Name both.

Distinguish activity metrics from outcome metrics. Hold outcome metrics to a higher standard of validity. Adoption rates tell you something is being used. They say nothing about whether using it is making the work better. Be suspicious of any framework where the measurable metrics are doing all the evidential work.

Build baselines against professional practice, not against doing nothing. The question isn't "is AI faster than no AI?" It's "is AI-assisted judgment better than unassisted judgment?" Answering that question honestly requires taking seriously the difficulty of the judgment, not just timing how long it takes.

Sources and references