Teaching AI Agents to Test 1,000 Java Libraries – and Letting Them Run While You Sleep

A 1,000 libraries, 90% coverage, 1,700 in API tokens. Nobody typed a single test by hand.

When humans maintained the GraalVM native image reflection metadata repository, coverage sat at just 14%. Tests were often stubs that technically compiled but covered nothing meaningful, nobody wanted to write them for someone else’s code, and the results showed.

At Devoxx UK, Vojin Jovanovic (Principal Researcher, Oracle Labs) and Mihailo Markovic (Software Engineer, Oracle), presented how they replaced that process with an autonomous AI agent pipeline.

The result is 90% dynamic access coverage across more than 1,000 JVM libraries, roughly 2 billion tokens spent, and a GitHub repository generating thousands of commits per week – while Vojin was at a hotel the night before the conference.

The problem with GraalVM reflection

GraalVM Native Image takes a Java application, performs static analysis, and AOT compiles it into a single binary. The benefits are significant: startup roughly 10x faster than a standard JVM, dramatically lower memory footprint.

But static analysis has a fundamental limitation: when a method calls Class.forName(“Foo”) with a dynamic argument, the analyser cannot determine at compile time what class will be needed. Reflection calls break the closed-world assumption.

The solution is reachability metadata – a JSON file that tells the native image compiler which classes, methods, and fields need to be accessible at runtime. Writing this metadata requires running tests that exercise all the relevant code paths.

For a library like Hibernate Core, that means covering 264 individual reflection call sites. For Tomcat, 205. Across the JVM ecosystem, the number is enormous, and until recently, it was almost entirely a manual process that humans were not doing well.

Start simple, then add feedback

The first approach was straightforward: give an LLM the library source code, tell it to generate comprehensive Java tests, collect the metadata via a JVMTI agent.

The results were not impressive – 5.7% coverage for logback, 2.9% for H2. Vojin noted how this doesn’t feel like AGI.

The shift came from adding GraalVM’s static analysis directly to the agent’s context. Instead of asking the LLM to guess which code paths matter, the pipeline runs a static analysis pass that identifies every dynamic access call site (the exact class, method, and line number) and feeds that report directly to the agent. With this addition, logback coverage jumped to 97%, H2 to 84.3%, in five iterations.

The next layer was JaCoCo integration. After each generation round, the pipeline correlates coverage data with the remaining uncovered call sites and feeds only the uncovered ones back into the next iteration. The agent knows exactly what it hit and what it missed. Vojin noted:

We always create a checkpoint in those systems so we can go back to it if something goes wrong. And in these LLM-driven workflows, something is always going wrong.

With this feedback loop: logback reached 100%, H2 reached 96.1%.

Coverage sometimes still isn’t enough

For larger, more complex libraries (Guava, Tomcat, MongoDB) even the feedback loop left gaps. The team added a third technique: PGO (Profile-Guided Optimization) profiling from GraalVM’s Graal compiler. The profiler samples execution and produces a call trace, which can be correlated with static analysis to identify exactly where a test nearly reached a reflection call but diverged.

The profiling feedback tells the agent not just what’s uncovered, but where in the call stack a test went in the wrong direction and what it would need to do differently. Results: Guava went from 50% to 72%, Tomcat from 45% to 83%, MongoDB reached 100%.

The feedback also tells the agent (and the engineers) why certain calls cannot be covered: a security service only available on Java 6, a cleaner class incompatible with the current JVM. “If you cannot reach it, tell us why,” the prompt instructs, and the agent does.

Cost, agents and model selection

Codex was the first agent framework the team tried. For logback (a library with 33 dynamic access calls) Codex spent $35:

If we’re spending $35 per library for a thousand libraries, we’re not replacing humans.

The alternative was P, a minimal agent that starts with a 200-token context describing basic file operations and bash execution. Same results, roughly 10x cheaper and the lesson is straightforward:

Simple task, use a simple agent. You already give it a lot of rules, a lot of context, and you’ve grounded it enough so it can perform on the level of these big agents.

On model selection, the team compared GPT 5.5 against several open-source alternatives – GLM, Kimi K2, DeepSeek, Gemma. GPT 5.5 consistently outperformed them on coverage. The counterintuitive finding was this: a more expensive model that makes the right decision in one shot can cost less overall than a cheaper model that wastes tokens going in the wrong direction.

The architecture that lets it run without you

The pipeline now operates as a third-generation system. When a user opens an issue requesting a library, the agent fetches the issue, runs the generation workflow, verifies the output, creates a pull request, reviews it, and merges or escalates to human review – automatically. The “human intervention” label on GitHub still exists, but its queue has shrunk dramatically.

Documentation, not smarter prompting, was what made the difference.

Vojin outlined what he calls the key context layers:

raison d’être (why does this project exist, in two sentences),
state of direction (where the architecture stands today),
functional specification (how the system behaves),
architectural specification (how it is built),
decision records (what major choices were made and why), and
comprehensive logs that serve as checkpoints for recovery.

When you do all of these things, it takes almost a few days for a very big project. You will reduce your work by 50%, 60%, 70%.

The payoff is that agents with this context can diagnose failures, trace them through logs, and fix the underlying system, not just the immediate problem.

The RAID system (an automated issue-resolution agent) was built in four prompts on a Sunday morning. It sweeps human intervention tickets, classifies them, performs deep analysis using the project logs, and either opens a GitHub issue for humans or attempts a fix in a forked branch with review. Jovanovic added:

Never work on the problem, always work on the system. You never go and fix a ticket. You always go fix the rules.

Where things stand

The repository currently supports 1,021 libraries. Without five large Hibernate libraries that predate the automated pipeline, dynamic access coverage across the ecosystem is 90%.

The GitHub repository has accumulated roughly 2,977 branches. In the week before Devoxx, it logged approximately 8,000-9,000 commits, with agents committing every few minutes around the clock.

Total cost for the project: approximately $1,700 in API tokens, plus personal compute on Jovanovic’s home desktop, running around the clock because the Oracle compliance process for cloud infrastructure takes time. The key point is simple:

Start with neural, simplest thing, get results, and then slowly chop off things and put them into algorithms, because they are much cheaper and faster.

We caught Vojin Jovanovic for a few more questions!

After the talk, we sat down with Vojin for a few minutes to ask him a couple more questions.

You tested over 1,000 libraries. What broke first when you tried to scale?

Vojin: Basically everything broke. We had mostly infrastructure issues, all kinds of GitHub failures. When you build a system at this scale, you need to assume that everything will fail and needs to recover. We broke GitHub rate limits. My machine was broken because it was running so many things. The key takeaway is that you need to build a system in a way that you can always continue. When things fail, you always checkpoint and continue from a checkpoint. We do work in sizable chunks, and when something fails, you just restart the chunk.

Is just asking the LLM enough?

Vojin: If you had asked me four weeks ago, I would say no. Now I would say you need to know how to ask it, and it will be enough. I was like, “GitHub is failing with a 504, abstract away all GitHub calls and retry.” It did it in two minutes. With today’s models, it’s a matter of minutes, not hours.

What did you learn about the trade-off between cost, speed, and coverage?

Vojin: I haven’t seen a situation when doing something with an LLM is more expensive than doing that by a human typing on the keyboard. Build a system that uses the most efficient LLM for the job — you’re going to get far and not cost much money at all.

When does using multiple agents make sense?

Vojin: Where I use it is for decisions and research. I use Claude Opus 4.7, Gemini 3.1, and GPT 5.5. I ask them all, let them discuss, and I discuss together with them. Each brings something to the table. Before, it was always Claude who was the smartest. Now GPT 5.5 is second and close to the first. Things are changing. The most important bit is getting the system designed right. Once you do that, coding, I don’t care who does it.