How Google got better at measuring developer productivity
Developer productivity is a touchy subject. Those who are putting in the effort don’t want to be treated as if they were a machine, and those who don’t meet expectations, no matter the reason, don’t want to be exposed or pointed fingers at.
But productivity doesn’t revolve just around output—it’s a baseline for general developer experience and satisfaction.
Yes, you can measure developer work output, be very exact with numbers and how many lines or commits they’ve done – but that won’t tell you a thing about the “mental engineering” required and if that commit really helped solve technical and user issues or was that code actually good and comprehensive.
Measuring difficult-to-measure things, including those that are not reflected in developer log data, is actually the most important aspect of developer satisfaction and productivity.
The Google’s approach
Google has approached its research on measuring Developer Experience with that in mind, and we can all learn from it. A recently published paper on the topic covers their findings from the Engineering Satisfaction (EngSat) survey, which they have used since 2018 to measure and understand developer productivity.
Having that insight helped them not only track the impact of extreme situations on developers—such as the COVID-19 pandemic—but also reduce technical debt, which a survey showed to be Google’s main productivity hindrance. It helped them pinpoint the problems more easily, and they partnered with internal teams to develop best practices, management plans, and initiatives aimed at addressing technical debt.
How did it all start? They’ve gathered a team, now known as the Engineering Productivity Research team, which includes software engineers and UX researchers from diverse fields, such as behavioral economics, psychology, public health, and more. The team’s main goal is to improve developer productivity through various research methods, such as diary studies, surveys, interviews, qualitative research, and log analysis.
How to measure: Speed, ease, and quality
Ciera Jaspan and Collin Green, who lead the Google team, shared their experiences in the Engineering Enablement podcast. They emphasized that no single metric can capture developer productivity.
So, for the research setup, they’ve decided to separate developer productivity into three main categories: speed, ease, and quality, and develop multiple metrics for each one. For instance, they assess speed using logs and developers’ perceptions of their own speed and validate this through diary studies and interviews.
We don’t rely on just one speed metric but use multiple metrics and validate them against each other.
The complexity of these metrics is necessary because their measurements have so many nuances. A great example is the measurement of build times. The team found that log data included automated background builds, which skewed the perceived build latency:
Well, it turned out that the log data included not just builds that engineers kicked off themselves but also a bunch of robotic builds that were happening in the background that the developer wasn’t even paying attention to.
Be mindful of…
Human factors inevitably complicate the relationship between objective metrics and subjective experiences. Using build time as an example, Colin noted that reducing build time doesn’t necessarily result in linear productivity gains.
If a build takes longer than a certain threshold, developers switch tasks, meaning that reducing build time from, say, 30 to 20 minutes doesn’t directly translate to a 10-minute productivity gain. The real impact depends on what else the developer did during the build and how long it takes to resume the original task. He concludes that these non-linearities and qualitative shifts, common in psychology, are crucial to understanding developer productivity and reactions to changes in build latency.
In addition, the aforementioned Google paper states two primary problems that are likely to affect any survey program: increasing survey length and decreasing response rate.
To address the survey length, Google decided to keep a subset of high-level questions focused on productivity, satisfaction, velocity, quality, and flow and occasionally shorten the survey by removing redundant or obsolete questions. This made a significant change for smaller developer groups’ surveys, removing more specific questions in favor of more general questions that better support company-wide decisions.
To combat decreasing response rates, they employed two strategies: sampling and transparency. By splitting the developer population into three random groups and surveying each group quarterly, they reduced the frequency with which each engineer is surveyed, ensuring a large enough sample size for statistical analysis. Additionally, they enhanced transparency and continued participation by distributing a widely accessible quarterly summary report detailing what is measured and how feedback is utilized.
How to get better at measuring dev productivity?
So, what are the key points to take from their experience?
- Use mixed methods. As we already covered, use multiple methods to measure each aspect of developer productivity. No data source is the ultimate truth; it is necessary to use both quantitative and qualitative data to get comprehensive and validated results. Quantitative data includes logs and behavioral data, while qualitative data comes from diary studies, surveys, and interviews.
- Enable collaboration and multidisciplinary approach. Make sure people who conduct research are equipped with knowledge about quantitative and qualitative methods. Also, enable them to consult freely with multiple other teams and internal experts when necessary. Checking in with developers is a must!
- Watch out for discrepancies. Sometimes, even using different metrics to measure the same thing won’t help you if developers’ input or log data were off in the first place. You will, however, notice that something doesn’t add up in the data. Make sure to investigate properly when you see discrepancies. There is usually a reason.
- Take reliability and scalability into account. Cross-check methods and their findings to ensure reliability across different data sources, especially for qualitative methods. Take advantage of quantitative methods, as these can be passively collected from all developers without additional effort.