From zero to full ownership of a legacy system in under 2 months
One Monday afternoon, I received a call from my manager, which went along these lines: “Hi Dan, how are you? Would you like to travel to the US on Wednesday and take over some legacy services?“. I had a lot of questions and not a lot of time to discuss everything, but I still agreed. This call changed what I would be working on for a year or more.
The next day, I got more context about the work and our end goal. It was to learn enough about the legacy system to keep it running until the replacement system was built and customers migrated. One caveat was that we had just 2 months to learn from the engineers working on it, so it was a tight schedule.
Fast forward one year, we have built a completely new system and are migrating customers. In the meantime, the legacy system was still functioning, and we hadn’t experienced any major issues. Not that there weren’t issues or a few surprises along the way, but we must have done something right as we managed to go from zero to enough knowledge about the system and how to maintain it in less than two months.
Getting the lay of the land
I knew next to nothing about the domain and workload the system was built to automatize. Luckily for me, most of it was written in Java, so I didn’t have to learn a new language as well as a new domain.
We organized a few in-person days with the current engineering owners, business stakeholders, and internal users to get a good start. There, we got an overview of everything and started diving into the details of both the processes and implementations of various services.
This is when we realized that we were really dealing with two subsystems. One consisted of two large monoliths over a decade old; the other was a newer microservices-oriented architecture. The latter was supposed to replace the two monoliths, but it was being done feature by feature. This meant that both subsystems were handling user’s requests, and we had to master both of them.
We started by collecting the important information about each service involved in handling the user’s requests. That consisted of high-level container diagrams showing most of the services, databases, and queues and how they communicated with each other.
For each service, we gathered the locations of the following:
- Source code
- Configuration files
- Build jobs
- Deployment jobs
- Any other useful links or documentation
When we were done, there were 17 services and 5 databases to take over. Although it seems like you could search for that information when needed, having everything in one place proved very useful, especially later on when we needed to onboard new people. Also, knowing where everything is located meant we could check that we have permissions for the tools and access to the necessary virtual machines and databases.
Rolling up our sleeves
While waiting for all the permissions mentioned above, it was time to do what we did best: read the code.
During the sessions with the business stakeholders, we gained an understanding of which functionalities were more important and which ones could be done manually if needed. That enabled us to first cover the ones that bring the most value.
We focused on happy paths because going through every line of the code wasn’t an option. I have to say that those monolith services were quite easy to read, as everything important was inside a single class or a single method without too many abstractions. On the other hand, we didn’t want to make any changes on the monoliths later, as we couldn’t exactly know what consequences a simple change would have.
One thing I wish I had done from the start was bookmark those important classes and methods so I could find them faster in the future. I am sure I spent too much time locating the controller method, which handled the HTTP request, and following the method calls until I got to the actual business logic I was interested in.
Still, we started understanding the main functionalities of each service and how to build and deploy them so we could get involved in the day-to-day operations and start troubleshooting real issues. It was crucial to see how the current owners handled those situations, as most issues have similar causes and need similar steps to fix them.
For example, when a job failed, the procedure was to change its status to “pending”, and the service would retry that job. Another common issue was errors when storing some Unicode characters in the database that wasn’t configured to use the Unicode character set. Situations like those could fill a whole other article.
The takeover
Once the time came for us to take full ownership of the system, we felt confident we could handle it because we were already troubleshooting issues and even deployed a few bug fixes. With that sorted, our focus shifted to designing and developing the replacement system to migrate the customers and shut down the legacy one.
Although this wasn’t a project with all the newest bells and whistles, I did learn quite a lot from working on it. The most important lesson was that most systems will become legacy at some point, and we must keep that in mind while designing and developing them. That way, the next group of engineers that comes along to maintain it won’t need to ask a million questions.