AI generates larger pull requests. Larger pull requests bring more bugs.

When companies start tracking engineer token consumption on internal leaderboards, something has gone wrong in the measurement chain.
Stephen Poletto, Field CTO at Span, used his CTO Craft Con talk in Toronto to argue that the AI tooling wave has arrived with a familiar problem attached: organizations are reaching for the most legible metric available rather than the most meaningful one.
The result is a rerun of every previous failed attempt to quantify developer productivity, but this time with a pretty substantial compute bill.
Burn, baby, burn
Poletto opened with a data point that frames the problem neatly. Uber and ServiceNow both burned through their entire annual AI token budgets within the first five months of the year. That pace of consumption is being held up in some quarters as a sign of healthy adoption.
Poletto’s position is that it mostly signals a measurement vacuum.
Just because you’re spending money and using these things doesn’t necessarily mean that you’re producing better outcomes. That’s the issue that I have with tokenmaxxing.
The leaderboard dynamic at Meta, where engineers were reportedly running expensive jobs purely to rank well on an internal token-consumption meter, illustrates the trap cleanly. Poletto named it directly: Goodhart’s Law.
Coined by British economist Charles Goodhart in 1975, it’s completely applicable to the 2026 problem: when a measure becomes a target, it ceases to be a good measure. Or, in today’s words: Set token usage as a goal and people will optimize for token usage, not for shipping software that works.
This isn’t a new failure mode. Poletto traced the same pattern through lines of code, pull request counts, and story points, each of which generated its own gaming behavior when elevated to headline metric status. Tokenmaxxing, in his framing, is:
The same old pitfalls of trying to quantify developer productivity all over again.
The alternative he proposed is a ratio: customer value delivered against the total cost of producing it, headcount, tooling, and token spend included. DORA metrics and PR throughput are not useless, he argued, but they measure the inside of the system, not its output. Treating them as primary goals disconnects engineering effort from the outcomes the business actually cares about.
What about controling the PR size?

Span’s own benchmark data, drawn from its customer base, puts some numbers around where teams currently sit. About half to two-thirds of net new code is now AI-generated, up from 10 to 20 percent a year ago. PR throughput is running at roughly 1.7 times pre-AI rates. Neither figure, by itself, says anything about whether those teams are delivering more value to customers.
The quality picture is more nuanced than the headline defect numbers suggest. Span’s analysis found that when controlling for pull request size, AI has a negligible independent effect on defect rates. The actual driver is that AI generates larger pull requests, and larger pull requests correlate with more bugs. That is, in principle, actionable: you don’t need to focus on the AI generated code, but to the PR scope discipline.
Different approaches to reduce human review burden

Code review is absorbing the strain more visibly. Poletto cited 30 percent more rework time on AI-generated code compared to human-generated code, along with more review round trips. Teams that are navigating this well are doing so through process changes rather than raw tooling, pre-review automation gates, semantic routing of review assignments by code ownership and reviewer availability, and environment-level QA that lets agents validate their own output before a pull request opens.
Stripe, Ramp, and WorkOS all came up as examples of teams that have built cloud environments where agents can run tasks more autonomously, with the explicit goal of disqualifying broken work before it reaches a human reviewer. Ramp’s approach to screenshots, attaching before-and-after visuals to PRs so reviewers can see what changed at a glance, is a small example of the same principle: reduce the human review burden by doing more verification earlier.
Fin from Intercom took a different angle, capturing agent-human interaction logs from development sessions and using them to provide personalized coaching to engineers on how to work more effectively with AI tools. Poletto noted they also tracked which agent skills were actively used versus deprecated, applying the same funnel analysis logic that product teams use for user flows.
The thread connecting all of these examples is that the teams seeing compounding returns are treating their development workflow as a system to be instrumented and optimized, not a collection of individual contributors to be nudged toward higher token counts.
Writing code is no longer a bottleneck
Poletto said in closing:
You should treat your development system as a system that can be optimized. Telemetry, observability, helping understand where those dynamics are, can help you be more confident in where you’re investing.
The bottleneck, his data suggests, is no longer primarily writing code but deciding what to build and validating that it works once built. That shift puts pressure on skills that most engineering hiring and evaluation frameworks are not set up to reward.


