Tokens and PR velocity won’t tell you if your AI investment is paying off

CTO Craft Con

Your AI metrics might be measuring the wrong thing. Dr. Catherine Hicks explains how to fix the theory, not the dashboards.

At CTO Craft Con, Dr. Catherine Hicks (founder of Catharsis Consulting and researcher on open-science work involving 15,000+ developers) argued that most engineering measurement failures aren’t really data problems.

No matter how much dashboards you vibe-code (we all do that nowadays, don’t we?), the problem is theoretical: understanding what causal model your metrics are actually testing.

Are you building spike lines?

Hicks opened with a historical detour that turned out to be her sharpest part of the argument. Probably a bit brutal from today’s point of view, but early electricity measurement looked straightforward: count the deaths, count the lumens, measure the hours of labor saved.

Why the number of deaths? Those numbers made sense relative to what came before. Gas infrastructure occasionally caused sidewalks to explode. Zero to ten electrocution deaths in a neighborhood looked different once you remembered that twenty people you knew had died from the sidewalk going up.

Measurement always encodes a comparison, and if you do not make that comparison explicit, your metric will mislead you.

The more cutting example was the spike line. In early U.S. rural electrification, large power companies would create something called a spike line. It’s a single wire that ran through a region, sometimes connected to nothing more useful than a light in a shed, in order to legally claim the territory and block rural co-ops from building real infrastructure. The federal government that was measuring electrification by the presence of a line wasn’t measuring access but the exact opposite.

Hicks argued that engineering organizations are building spike lines right now, and mostly do not know it.

Measurement failures usually come back to theory failures.

The loudest thing (on Slack) might not be the most important

A change theory, in her framing, is a causal model: if I give engineers more ramp time on this language, they will engage differently with these processes, and that will produce a business outcome. Everyone operates inside such models, but most teams never make them explicit.

The result is that they default to measuring what is visible, what is fast to collect, or what is loudest on Slack. None of those is necessarily connected to the outcome they care about.

On AI specifically, she identified three recurring failure modes:

ed patterns within weeks. Third, measuring only at the individual level while ignoring the interaction between the tool, the team, the task, the project goals, and organizational culture – a system Hicks described as having too many interaction terms for any human to hold simultaneously.

Treating developers as interchangeable units producing identical work, when they’re actually using AI in highly individual contexts.
Reading early spikes in code volume as a stable signal, when research on open-source repositories shows those spikes often normalize into different patterns within weeks.
Measuring only at the individual level while ignoring the interaction between the tool, the team, the task, the project goals, and organizational culture – a system Hicks said has too many interaction terms for any person to hold at once.

Her response to that complexity was deliberate: stop trying to measure everything.

Your job is not to measure everything that’s happening. What you really need to do is think about what in your organization are the key signals to measure if they give you two things.

Those two things are levers, predictors of large behavioral patterns, or blockers you can actually remove, and shared outcomes you can hold the organization accountable to over time.

Correlation between learning culture and team effectiveness

Hicks drew on her own research to illustrate what a well-theorized lever looks like.

In a 2024 study of over 3,000 working developers, her team tested whether learning culture could reduce the identity threat that developers report feeling from AI. Teams with strong learning cultures, defined as believing they have organizational support to learn, and rejecting fixed-mindset assumptions about what makes someone technically capable, cut their AI-related identity threat by 50% or more. She has since replicated the correlation between learning culture and team effectiveness across multiple engineering organizations, and found that reaching the highest effectiveness tier requires clearing a threshold of learning culture, not just improving incrementally.

The practical upshot for leaders: if you want to measure whether your AI investment is working, you probably should not start with token consumption or PR velocity. You should ask whether developers believe their organization treats learning as legitimate work.

What if your metrics are broken?

On the question of what to actually do when you realize your metrics are broken, Hicks was direct. Describe how you got there. Find the argument you lost, or the organizational story that made a bad measure feel safe to commit to. Then build a change theory first, and derive the measures from that, not the other way around.

She closed with a framing that captured her broader position: you are not trying to map every variable in a complex system. You are trying to create enough shared understanding that people will tell you what is actually happening.

You will become the person who can hear what is really happening. You will be the person who will start to hear: I think we’re measuring spike lines over here.

That, she argued, is more useful than any dashboard.