How Infobip’s Infrastructure Team Handled 10 Billion Messages in a Day

Behind every text, voice call, and digital message that reaches our phones, there's a sprawling, complex system of servers, cables, and code. For a company like Infobip, which processes up to 10 billion messages a day, this infrastructure isn't just a foundation — it's a story of evolution.

At the recent Shift Conference in Zadar, Infobip Senior Software Engineer Josip Antoliš shared the story of the company’s rise. He revealed how a decade of challenges — from unexpected opportunities to unfortunate outages — forged the technology that keeps the world connected.

His talk, “10 Billion Messages in a Day, How We Built the Infrastructure That Delivers,” wasn’t a dry technical rundown. Instead, it was a journey through Infobip’s history, a candid look at the moments that broke their systems and forced them to rebuild something more substantial.

It isn’t just a tale of scaling up; it’s a playbook for turning crises into opportunities, using each failure as a blueprint for the next big leap forward.

This article explores the key takeaways from that talk, the pivotal events that shaped Infobip’s on-premise and cloud infrastructure, and the core philosophies that guide its engineering teams today.

The Challenge of Scale

Building and maintaining a robust technical infrastructure is a constant balancing act, especially for a company operating at the scale of Infobip. The company’s infrastructure is a sprawling, dynamic environment with 61 data centers, around 40,000 virtual machines, and 1,300 physical servers.

Every day, they handle billions of messages, ranging from SMS and voice calls to emails and instant chat, all while supporting hundreds of in-house development teams that manage over two thousand distinct microservices.

They need a resilient and adaptable infrastructure to handle this high-velocity environment, where they constantly ship new features and products. But how did they get here? This story isn’t just one of planned growth, but also of learning from and adapting to a series of crises that have fundamentally shaped their infrastructure.

Crisis 1: The Email Tsunami (2008)

IIn 2008, a new client came on board with a daunting request: they needed to send billions of emails. While Infobip was used to handling thousands of SMS messages per second, email was a different beast. Emails often contain large attachments, and the sheer trnetwork throughput threatened to overwhelm their existing internet connections.

Solution: Rather than waiting for new connections, Infobip looked to the cloud. They provisioned virtual machines in AWS and quickly deployed a few key microservices. They set up an API service to accept incoming email traffic and cache attachments in local S3 storage, and then forward the message metadata through their network.

The large attachments never had to cross Infobip’s limited connections, solving the core problem.

Lesson: This crisis taught Infobip the value of a hybrid cloud strategy. They realized they could treat cloud resources as an extension of their on-premise infrastructure. They maintained the same developer experience by deploying the same virtual machine (VM) images and extending internal tools to work with both cloud and on-premises VMs.

It proved invaluable during 2020 hardware shortages, allowing them to rapidly deploy new data centers in the cloud instead of waiting months for new physical servers to arrive.

Crisis 2: The Labor Day Disaster (2018)

In 2018, a significant incident occurred over the European Labor Day weekend. Both of data centers in Frankfurt went down. The cause was a memory leak in the network switch software. The faulty software was applied to both redundant switches during the same maintenance, and since memory leaked at the same rate, both switches crashed at roughly the same time. It triggered a chain reaction that took down their storage and all connected VMs.

Solution: The post-mortem for this incident led to a complete re-architecture of their on-premise virtualization and storage. They migrated from their old hypervisor to VMware, gaining better security, stability and powerful APIs. They also decoupled their storage from the network infrastructure, ensuring it wouldn’t fail in a similar incident.

Lesson: This event highlighted the critical importance of a robust disaster recovery procedure. Infobip started practicing recovery in the staging environment that mirrored their production setup. Every Monday, they would simulate a disaster, forcing applications to recover and ensuring they could handle failures. They also fixed circular dependencies in their microservices, allowing them to start even if their dependencies were unhealthy.

Today, they regularly perform production-level failovers to test their recovery procedures, which has made them incredibly resilient.

The company’s infrastructure is a sprawling, dynamic environment with 61 data centers, 38.600 virtual machines, and over 1.300 physical servers.

Crisis 3: The WhatsApp Business Boom (2019)

In 2019, Facebook (now Meta) launched WhatsApp for business users, but with a catch. Businesses had to self-host the solution using a couple of provided Docker containers. It was a massive opportunity for Infobip to help, but their existing model of provisioning a separate VM for each customer was unsustainable. Manually onboarding hundreds of new customers was too slow and inefficient.

Solution: Infobip turned to Kubernetes. They deployed a cluster and automated onboarding of new customers, immediately solving their scalability problem.

However, hardware shortages soon made scaling their on-premise Kubernetes clusters difficult. They quickly migrated to Azure’s managed Kubernetes service, allowing them to provision new nodes easily.

Lesson: This crisis led Infobip to a deep understanding of Kubernetes. They eventually developed their own in-house, fully automated “Infobip Kubernetes Service”.

This system leverages the VMware APIs that were rolled out in the aftermath of the second crisis, allowing them to deploy and manage on–premise clusters. This hybrid approach to Kubernetes will enable them to choose the best solution for each use case.

Crisis 4: The Decommissioning Mistake (2024)

Sometimes, a crisis is self-inflicted. In 2024, an engineer decommissioned a data center in Rome. The engineer noticed that all VMs in that data center had the string “-ro” in their hostnames and wrote a regex to delete them. Unfortunately, the regex also matched some of Infobip’s critical routing VMs with “-rout” in their names.

Solution: After the incident, they implemented a new data center decommissioning procedure to prevent similar issues. Most importantly, Infobip’s tooling now automatically delays destructive actions. When a VM is “deleted,” it is simply stopped and scheduled for deletion a day later, allowing for quick recovery from accidental actions.

Lesson: This incident validated a previous decision to depart from GitOps and develop custom services for handling infrastructure. While previous attempts to migrate all infrastructure management to GitOps failed because of the complex tooling and slow developer adoption, in this case, having in-house services allowed Infobip to update its tools and implement the delayed deletion feature quickly.

The next frontier: The Agentic AI Crisis

What’s the next challenge for Infobip infrastructure? The ongoing revolution in AI, specifically agentic AI. Instead of fearing it, Infobip is embracing it as an opportunity. They are developing AI agents that can interact with their infrastructure management APIs. An AI agent can now provision a new VM, resize storage, or troubleshoot an incident, all based on the conversation with the user.

No more hostnames, regex, or Prometheus queries — just natural language questions and answers.

Using the same well-structured and validated APIs with safeguards like “destructive actions delay” gives them the confidence to let AI agents manage their infrastructure. The AI can suggest actions based on error logs and metrics, helping human incident responders resolve issues faster without manually sifting through data.

The evolution of Infobip’s infrastructure is a testament to the power of learning from mistakes. Each crisis, no matter how large or small, has catalyzed innovation and made Infobip’s systems more resilient, flexible, and ready for the challenges of tomorrow.

Want to know what else the speakers discussed at Infobip Shift in Zadar 2025? Find out here!