How to turn peak season chaos into a developer’s success story

Everyone experiences high-pressure days at work, but for our team Black Friday takes the crown.

When high-demand periods hit, our clients inundate us with millions of messages, creating both objective and subjective stress.

On the objective side, the SMS processes we manage are vital to the company’s success, directly impacting revenue during critical moments. Infobip is a global B2B communication platform, so much of the communication from brands to their clients flows through us.

As you can imagine, the number of marketing SMS messages can skyrocket during promotional campaigns, holiday seasons, and product launches – not just on Black Friday. Handling tens of thousands of requests per second is no small feat.

Let’s explore how systematizing analysis, planning, and testing has eased the process – during peak times and throughout the year.

Expect the unexpected!

Perhaps it sounds cliché, but there’s a lot of truth to that phrase. If you don’t anticipate possible scenarios that could impact your system, many things will catch you off guard.

The goal is to reduce these surprises to a manageable level.

We should all be familiar with our systems’ core functionalities and main building blocks. These are typically described through happy path scenarios, which we test and monitor. While this is a good starting point, real stress doesn’t come from happy paths—it comes from the problems and failures that arise when things go wrong.

This is especially true for large-scale live products like ours, where maintaining top-tier availability and reliability is crucial. Managing risk means identifying the critical components in our application machinery, analyzing the impact of their failure, and determining how quickly we can recover or replace them.

‘We need to think about the unexpected’ doesn’t mean we should anticipate a billion different problems. Instead, we should focus on groups of issues that could impact our critical processes – like network outages, database problems, or downstream service failures.

This approach helps us create a matrix of processes, potential issues, and their impact.

Now that we know what we’re protecting and what we’re protecting against, we should develop plans for these scenarios and establish fallback solutions. This turns our initial uncertainty and fears into a sense of safety and confidence, even in situations beyond our control.

An important step in this process is to quantify everything. By making problems measurable, we can more easily automate both alerts and recovery processes.

Real-life problems for SMS

We identified that our crucial processes and properties include reliable message delivery, high throughput, and low latency.

High throughput is managed with an architecture that supports independent operation across nodes, allowing seamless scaling. Traffic peaks are handled through careful analysis and forecasting, ensuring adequate resources are in place.

We adopted a reactive programming approach to effectively address throughput and latency issues. This approach improves system responsiveness and scalability by efficiently handling asynchronous data streams.

Reliable message delivery is essential.

However, we identified several issues that could compromise this reliability. As mentioned earlier, we did not attempt to address every potential issue but instead focused on the most critical problems, specifically the inability to send messages to the core processing system.

To mitigate these issues, we implemented persistent storage with retry mechanisms and established comprehensive monitoring to ensure prompt problem detection and resolution.

Rely on tests, not assumptions

Assumptions can catch you off guard, especially when the system – and you – are pushed to the limit. If something seems to work but you’re unsure why, don’t rely on coincidences. Instead, address any uncertainties, assumptions, or potential coincidences by testing them while you still have time.

We use a suite of tests to validate the application’s functionalities through unit and integration testing, providing a solid foundation. However, since the application operates within a larger ecosystem, many challenges stem from its interactions with other systems.

End-to-end tests developed by our team can be executed as needed, which is highly beneficial. However, Black Friday brings significantly higher traffic and system stress. To prepare, we conduct intensive load testing well in advance.

This testing brings together people from various sectors and applications, resulting in valuable insights that may not be obvious in daily operations. Much like frogs in gradually heated water fail to notice the danger, our applications can experience gradual performance degradation as they are built piece by piece. These issues often only surface during load testing.

Expectations continue to rise each year, driven by an increasing number of clients and higher system demands. One such challenge arose with message throughput: last year, expectations surged by 50%, and system performance dipped slightly due to various hardware and software changes. However, through timely load testing, we identified and resolved all issues, ensuring clients experienced seamless performance.

Real-life bottlenecks for SMS

We encountered a limitation in our storage solution that impacted performance during Black Friday. After research and testing, we successfully optimized the data flow logic to enhance efficiency.

We leveraged in-memory solutions more effectively, while the previously primary storage was now used only as a fallback. This change was logical, but altering the flow could introduce the risk of unexpected problems.

What was crucial in this process was our confidence in the testing we performed, including unit, integration, end-to-end, and load tests. This thorough testing ensured that we could make the change with assurance.

Work once, benefit many times

Don’t discard your tests and the effort behind them, just as you wouldn’t throw away your main code. Special load tests designed for Black Friday scenarios are stored as projects in our repositories and can be run as customizable Jenkins jobs. Load test servers can be created on demand, allowing us to run automated tests whenever needed on the data centers of our choice.

Organize the team with interchangeable roles

The team was maintaining a strong and reliable system, but at that time, it was mostly composed of relatively new employees.

By relying on one another and sharing knowledge, we built a flexible team with interchangeable roles. This reduced stress, enhanced collaboration, and allowed us to divide tasks, ensuring no one became overwhelmed.

Continuous testing brings continuous confidence

Think of it like this: ‘It’s easier to do one pushup a day than to try and do 365 all at once.’ Similarly, continuous monitoring throughout the year reduces stress by keeping our system in shape over time. Rather than trying to boost performance after a year of development, regular checks and tests ensure steady progress.

By testing every new feature as it’s implemented and conducting daily or ad-hoc load tests, we prevent performance bottlenecks from piling up. This approach allows for continuous development and avoids lengthy code freezes before major events.

Consistency in testing and monitoring transforms uncertainty into confidence, keeping systems resilient and the team prepared to handle future needs.