AI Hasn’t Made Developers Faster, It’s Made Their Review Queues Longer!

ShiftMag — Thu, 02 Apr 2026 09:53:30 +0000

A developer uses Copilot to write 30 lines of code in 10 minutes, but then spends 45 minutes reviewing it – checking for bugs, edge cases, and code that doesn’t match team standards.

The time saved during writing gets completely eaten up during validation. And this is exactly what happens repeatedly across teams trying to adopt AI at scale.

At the Pragmatic Summit, Laura Tacho (CTO at DX) shared some interesting research on AI in coding:

Almost 93% of developers use AI assistants every month, and about 27% of production code now comes from AI. Yet, despite all this, overall productivity has barely budged – staying around a 10% boost since AI tools arrived.

AI adoption is everywhere…

The numbers are clear:

92.6% of developers use AI coding assistants monthly
75% use them weekly
26.9% of production code contains AI-authored segments

84% of developers use AI tools, according to Stack Overflow’s 2025 survey. Adoption is now standard – the numbers are probably even bigger now.

…Yet work isn’t moving any quicker

The gap between adoption and productivity appears first as a trust problem.

46% of developers don’t fully trust the output, and that skepticism has a reason: reviewing AI-generated code frequently requires more effort than reviewing human-written one.

The DX AI Measurement Framework (published by vendor DX but structured as an industry standard) identifies this directly:

Code generated by AI may be less intuitive for human developers to understand, potentially creating bottlenecks when issues arise or modifications are needed.

This is why productivity hasn’t jumped. Developers might write code faster with AI, but they end up spending the same time checking, fixing, and making sense of what AI produces. In the end, the overall development cycle doesn’t get any shorter.

Sonar’s research confirms the pattern at scale: 42% of committed code now includes AI assistance, yet 96% of developers say they don’t fully trust AI-generated code. And this is exactly what we see: output is everywhere, but the confidence in it is not.

Why productivity has stalled?

That 10% productivity bump comes down to a workflow mismatch.

Teams started using AI to write code faster, but didn’t adjust how they review, test, or integrate it. In other words, writing got quicker, but everything that comes after stayed just as slow.

The DX research notes a broader context relevant here: most organizations see their biggest bottlenecks not in code generation, but:

In the outer loop, or in human factors like collaboration, alignment, and the ability to do deep, focused work.

AI addresses one specific problem, and that’s code-writing speed. But, as we can see, the overall development cycle has other constraints.

Teams that actually see productivity gains from AI usually do two things: they figure out exactly where AI adds value, and they tweak their workflows to make the most of it. Teams that just deploy AI without changing how they work? They get adoption, but no real boost in productivity.

The 10% productivity ceiling sticks because the time spent validating AI-written code cancels out the speed gains. Most teams focus on writing faster, but few have optimized for faster validation.

It’s an obvious obstacle, but maybe also an opportunity.

The post AI Hasn’t Made Developers Faster, It’s Made Their Review Queues Longer! appeared first on ShiftMag.

Want to build a more accurate Copilot with fewer hallucinations? Move from prompting to fine-tuning.

Tena Šojer Keser — Tue, 07 May 2024 13:12:32 +0000

Is prompting enough? Emanuel Lacić asked this question on the stage of the Shift Conference in Miami as he explored the process of creating a Copilot for a UI-based chatbot builder. 

The chatbot builder in question, Answers Copilot, is a GenAI feature that enables end users to design a chatbot based on their natural language input. GenAI creates an outline of the design of how the chatbot should behave, automating the chatbot building process to a degree, and the end user then customizes it to meet their requirement. 

Starting with prompting

The initial process relied on prompting: Emanuel and his team described what the underlying code looked like, had Open AI generate the code blocks representing visual elements, and then plugged it in to have it rendered in the UI. Preferably with as few hallucinations (i.e., generated code that leads to an error when rendering), and as predictable output as possible. 

They tested different prompt engineering strategies with Microsoft’s API for GPT-3.5 Turbo. By testing different techniques ranging from zero-shot to few-shot prompting with domain-specific instructions, they managed to lower the percentage of hallucinations to 12.63% on average. Accuracy was measured using HitRate – the number of times where the generated code blacked matched to a 100% of what was expected – which peaked at 2.13%.

Having created the Copilot using different prompting strategies, it was time to answer Emanuel’s titular question: Is prompting enough? The team decided to test the hypothesis that LLMs with context-specific data might yield a lower percentage of hallucinations and higher accuracy (i.e., by measuring the HitRate and turning to fine-tuning.

Bigger is not always better

As end users can task the Answers Copilot with creating a chatbot for a variety of use cases, the task of fine-tuning it required the team to know what input users might provide, as well as what is the desired output. Since real-world data was not available, GenAI was put to the task of synthetically creating some. 

The data was then used to fine-tune LLMs of various sizes: OpenAI GPT-3.5 Turbo (large), Mistral 7B Instruct (mid), LLaMa 3B (small), and Sheared LLaMa 1.3B (tiny). In addition to training the models with relevant data, the team used LoRA to fine-tune visual element generation. 

The fine-tuning process did yield the desired results: LLMs trained on relevant data had a significantly lower number of hallucinations, with 0.04% as the lowest achieved hallucination rate. The accuracy, on the other hand, also improved significantly, where the HitRate climbed up to 26.72%.

Interestingly, Emanuel notes the best performing models were Sheared LlaMA (in terms of hallucinations) and Mistral 7b Instruct (when it came to HitRate):

Sometimes you don’t need the largest, best performing LLM. But the only way to know which one performs best is to experiment – you can’t know beforehand.

What’s next?

There are always ways to polish Copilots, with user feedback being the logical next step. To that end, he showed the KTO method (Kahneman-Tversky Optimization): As it requires only a binary signal (desirable/undesirable outcome), the user feedback data is more abundant, cheaper, and faster to collect than data based on user preference between two different outputs, which is used in other popular methods like Reinforcement Learning. KTO is also a good choice when there is a marked imbalance between the number of desirable and undesirable examples.

To take user feedback a step further, a multiarmed bandit algorithm can be used, as Emanuel demonstrated, to determine which of the LLMs produces the most favorable results while running in production and, consequently, which LLM to choose in an automatic way.

You can find Emanuel’s slides here or find out more about his work on his personal website.

The post Want to build a more accurate Copilot with fewer hallucinations? Move from prompting to fine-tuning. appeared first on ShiftMag.

Copilot Archives - ShiftMag