Probabilistic Software Engineering, Demystified

Traditional software engineering is deterministic: we act like traffic controllers, owning the roads, the signals, and the rules. We decide exactly where data flows and when. To make this possible, we must maintain complete control over the entire software development life cycle (SDLC).
This approach to building software is rapidly losing relevance.
Agentic engineering is probabilistic. We are dispatchers. We give instructions to a driver (an LLM) who might take a shortcut, get lost, or decide to drive on the sidewalk because it “seemed faster.”
This doesn’t mean we’ll suddenly stop looking at our code, debugging, writing tests, or making sure quality is maintained. All of our quality and security checks will still be here. Hell, there may even be more of them – each new “problem” usually spawns an entire product stack to solve it. The adjustment is what’s painful.
But the genie is out of the bottle. Everyone knows you have your agentic coding tool at your disposal and that you can ship faster. It’s uncomfortable, but as the pressure mounts, we’ll – just as we always do – learn to live with it.
When probabilistic thinking kicks in…
For products where you want to embed a dynamic choice selection – a brain, or, as it’s trendy to call it, an agent – traditional hardcoding is out the door. Your booleans start to look like “it depends” rather than simply True or False.
In a world where your function is executed by a model that “reasons” through the problem and makes its own choices, you have to shift your mindset to “What can I control?“. You still have control – it’s just tangled in a mess of statements and linguistic trickery designed to make it work.
Your comments and docstrings suddenly become critical—they serve as guidance for an artificial agent to which you delegate work. For example:
def fetch_user_emails(userId: str)→ List[dict]:
"""
Use this tool when your task requires you to take a look at user's emails.
Args: userId: str, get this from the environment
Returns: List: A list of user emails
"""
# Implementation goes here
In this case, your Agent will decide when to execute this function based on its description. Even if you wrote the best implementation in the world, it might not matter. You could have failed in the description (giving the wrong clue for when to execute it) or forgotten to define the arguments clearly. In that case, the Agent will try to guess and may call the function with something that breaks the whole thing.
Failure is probabilistic, too
The underlying software we generate write today is still just software. As the trend shifts toward AI-generated code, we no longer worry about writing unit tests – these are now often beautifully handled by AI and are becoming less critical in fully GenAI-dependent applications. By that, I mean apps whose success relies entirely on an agent delivering output to the user, whether through a UI or via an API call.
This statement comes from our experience in delivering large and complex JSONs that represent a building block of the user’s experience in one popular communication channel, and is based on self-healing logic. Since the models we use are generative, they generate – so why wouldn’t they be able to regenerate?

It’s as simple as looping back the error with a bit more context so the agent can regenerate the output. The critical part is actually providing enough context for the agent to fix it properly.
What you see is not what you get in the long term
Building with GenAI means working in a field that evolves rapidly, so you must be ready for anything.
Don’t clash with the titans
Building a feature that will be rendered obsolete when the whole field (meaning OpenAI, Google, Anthropic) starts to offer a bypass to your feature via a simple API call is something that has happened more than once in the past few years.
It happened with model context (for the majority of cases, no need to chunk the data anymore…), and it happened with OCR (transformers do this extremely well nowadays).
It will happen with anything that is a horizontal use case that is not specific to your domain.
In a nutshell, stick to your vertical, domain-specific use cases and weigh your options.
Build, then buy
It sounds counterintuitive, but please stay with me for a moment. Vibe coding is real, and it’s getting better and better – the products that support it are improving constantly – and there’s no way around it.
Sure, there will be cases where folks don’t know exactly what to do, and sometimes they do something brutal like rm -rf their whole home directory. These cases are a minority and usually end up as learning by doing. Brutal, but still.
The fact is, you don’t need an ensemble of teams and departments to test out your theory. You can easily build a proof of concept by spending just a few hundred on tokens, and then:
- Buy if you don’t feel comfortable going through the whole SDLC and maintaining it.
- Don’t buy if you haven’t validated your proof of concept or if it doesn’t make sense.
Much better than spending $40k on a yearly subscription for something you were sold on after seeing a cool demo (btw, never, ever trust a GenAI-powered app demo).
Prepare to change your architecture every six months (or less)
Frameworks come and go, and sometimes they evolve. Models do too, often even faster. This means that something you built in 2023 and then left alone is probably performing laughably by today’s standards because:
- GenAI inference got cheaper by orders of magnitude.
- GenAI models now have context windows >100× what they had before.
- External integrations have changed completely (where was MCP in 2023? 2024?).
- Product use cases and user expectations have shifted.
However, adopting a mindset of constant change, and supporting it with tooling that takes away most of the work, makes this still possible. Maybe even more possible than ever before.
No way around context engineering
GenAI models only understand the context you provide, which means that context engineering is still absolutely critical for any task to succeed. Getting this right requires understanding the impact of each piece of the LLM context, because today it typically includes:
- Message history
- System/developer prompt
- Tool definitions (if local)
- MCP tool definitions (if remote)
- MCP system prompts, schemas, argument definitions… (see the problem here?)
- Results of all those tens or hundreds of tool calls
- Pieces retrieved from external sources (if using retrieval and similarity search through a vector database)
So this can escalate very quickly. The models that gave us bigger context windows have removed blockers for use cases that were previously unimaginable – but they’ve also introduced a new challenge: context management. In the end, if you accept that things are probabilistic, you’ll manage just fine.


