25.07.2024.
Signals

From a feature to a problem – from a problem to an opportunity

Our feature for sending SMS to "Trusted MSISDNs" hit distribution snags in our microservices setup. This hiccup not only got us back on track but also sparked ideas for turbocharging our entire system.

Houston, we have a problem!

What the hell is going on?! Alerts started popping up everywhere, multiple services crashed and mass confusion arose between colleagues trying to pinpoint the center of the problem.

This does not happen very often, and when it does, it usually involves a bigger problem that is out of our reach (Kafka problems, datacenter issues, etc.) and has nothing to do with our code, but this was not the case.

Let me present you the feature

Our feature allowed our customers to import a large number of mobile numbers and whitelist them, ensuring that the SMS that is sent to those numbers is not blocked at any moment, even if our analysis detected them as fraudulent. This ensured that traffic sent to those mobile numbers was always sent. We call these “Trusted MSISDNs”

The problem was how we distributed those numbers to other services, which needed to be aware of this information.

Our architecture wasn’t ready

Since we are working with microservice architecture, we are using a configuration service to send configuration updates to other services using HTTP(RMI). The configuration service also received information about Trusted MSISDNs, stored them in the database, and then sent them as part of the configuration file to other services. This is where the problem started.

We did not expect that a huge amount of Trusted MSISDNs would become a burden to our configuration file:

1. Configuration files became much larger – resulting in HTTP timeouts

2. Serialization and De-serialization took more time and used much more RAM

3. Configurations were stored in memory and updated often, resulting in being too big for our virtual machine

As a result, when services tried to fetch configuration, they either had timeouts all the time, making them work with non-updated configuration, or they fetched configuration and crashed because not enough RAM was added to the service to handle de-serialization.

You cannot think of everything

This functionality was developed first as a proof of concept, and eventually, it was improved to the point where we used it in production. We followed the architecture we already had, and it worked well without apparent problems.

Millions of numbers were imported, the UI tested, and the feature was in production for many weeks. Sometimes, you can’t think of everything, and in our case, the estimation of imported numbers was not big enough—and we thought big.

We recognized that as our product evolves, our architecture must also adapt to accommodate our expanding customer base. This functionality received more attention than we initially thought, and we needed to change our approach.

Solving HTTP issues and optimizing future features

We have moved from RMI and started using Kafka to resolve our HTTP(RMI) issues, and the in-memory configuration was switched to Redis as a fast-responding database.

The problem that occurred opened our eyes and made us take the first step in changing how we handled configurations. It also changed how we think about future features, where we already took different approaches and where a huge amount of data does not represent a problem.

This alert showed us a potentially big problem over time that could be much harder to handle if it came all at once. It was time for a change.

Crisis – > Opportunity

Problems sometimes lead us to resolve more than one issue and tackle core problems. However, due to our busy schedules, these issues never get the spotlight.

Sometimes, unforeseen challenges can drive significant improvements, transforming crises into opportunities. You never know. 🙂 Maybe the problem you have today will make your life easier tomorrow.

Embrace problems and good luck!

Find out more about how to combat fake OTP traffic.