Root cause analysis
Problem: Slowness and errors while working in TOPdesk between December 14th and December 18th.
Contents:
Root cause
The exact root cause remains unknown. Investigation performed in collaboration with Microsoft Azure and Linux experts suggests a load-related problem with Red-Hat Linux based proxy servers hosted on Microsoft Azure. This problem causes unexpected CPU blocking, effectively crashing the server at random moments.
Replacing the operating system with an alternative Linux distribution has resolved the occurrence of this problem.
Mitigating actions and future steps
We replaced all proxy servers in the EU1 and BR1 hosting location with proxy servers running on a different operating system family. The proxy server crashes have not occurred since. A team will investigate if the current alternative operating system is the best choice for our proxy servers in the long run. We’re continuing our investigation into the root cause of this problem with the help of external experts.
While investigating the current solution we were able to mitigate the impact by redirecting traffic* using servers hosted in different hosting locations. We’re working to document where this alternative could also be beneficial, and to prepare our network where feasible, so we can quickly implement a similar solution should this be needed in the future.
Several possible improvements in our internal and external communication have been identified. A team is working to improve our communication procedures, adding to our continuous effort to make sure our communication during major incidents is concise, timely, and in line with customer expectations.
We’re investigating if we can improve the detection of availability problems by making browsers aware of the availability of the TOPdesk server. This would allow us to detect problems at an earlier stage, improving the reliability of our product, and helping us gather more relevant error messages.
Time line
From October 22nd our proxy servers in Brazil (BR1) started malfunctioning occasionally. The problems could usually be resolved rather quickly, and only a few customers were affected. We started investigating the issue, but since these problems were infrequent, couldn’t be reproduced, and didn’t occur on other hosting locations with similarly configured and hosted proxy servers, the problem couldn’t be resolved for a long time.
We opened a support ticket at the hosting provider (Microsoft Azure) and we were working with their Linux experts to find the root cause of the problem. We tested several changes in our software and Operating Systems, but the root cause hadn’t been found yet.
On December 14th at 12:00 CET a proxy server in the pool of active proxy servers in EU1 started malfunctioning. According to our recovery plans from the malfunctioning Brazilian proxy servers, engineers tried restarting the server. When this didn’t help we removed the proxy server from the active pool, mitigating all customer impact from 12:50.
An update of the operating system for the proxy servers was identified as a probable root cause, and we started a project to replace the proxy servers in EU1 and BR1 with servers running on a different version of the same operating system.
On December 15th at 15:00 CET another proxy server in EU1 malfunctioned. The server was removed from the pool of active proxy servers, and we continued with our project to replace the proxy servers.
On December 16th early in the morning we noticed that several proxy servers in EU1 were malfunctioning. Since the impact was much bigger this time, the problem was escalated and several actions were started simultaneously;
The status of our internal investigation was escalated to assign more resources to finding the root cause of this problem;
A project was started to investigate if re-routing traffic to the EU1 hosting location via proxy servers in our NL3 hosting location was a feasible workaround to mitigate the impact.
A team started meticulously comparing the proxy servers in EU1 and BR1 to those in other Azure hosting locations where the problem didn’t occur. Since all (proxy) servers are deployed from a template and updated automatically, very few differences exist. We did find minor differences in installed updates, and the number of restarts since the last update. We started testing if this could explain the malfunctions.
On December 17th the investigation continued.
A project was started to recreate our proxy servers using a different operating system. We had some strong indications that the issue was related to the operating system, but so far we had been unsuccessful in finding and resolving the root cause.
On December 18th we were able to start redirecting traffic* for TOPdesk environments in the EU1 hosting location via proxy servers in the NL3 hosting location. The change was implemented gradually to prevent an impact on customers in the NL3 hosting location.
Proxy servers using a different operating system were deployed in EU1, but it turned out these new proxy servers also ran into the same problem after some time. However, with most of the traffic redirected to proxy servers in the NL3 hosting location, the impact for customers was now minimal.
On December 24th, after testing several alternative proxy server solutions, we deployed the current proxy servers (using a different Linux family) in production. The intermittent sudden load issues have not reoccurred on these new proxy servers. We’re back to our regular pool of proxy servers, and no more traffic is re-routed via the NL3 hosting location.
On December 29th we decided to close the major incident in My TOPdesk and to remove the disruption from our Status page, as the new proxy servers had now been running for several days and no customers had reported any problems for over a week.
Our investigation into the root cause of this problem still continues. We still have 2 hosting locations where the proxy servers haven’t been replaced, and the issue has not occurred there. We continue to work with external experts and Azure hosting engineers to find the root cause of this problem.
* Note that while the redirect was in place, customer data was (and is) still stored in the same hosting location. Requests from customers to their TOPdesk SaaS environments took a different route, but ended up at the same destination.