Slow TOPdesk environments in EU1 hosting location

Incident Report for TOPdesk SaaS Status page

Postmortem

Root cause analysis

Problem: Slowness and errors while working in TOPdesk between December 14th and December 18th.

‌

Contents:

Root cause
Mitigating actions and future steps
Time line

Root cause

The exact root cause remains unknown. Investigation performed in collaboration with Microsoft Azure and Linux experts suggests a load-related problem with Red-Hat Linux based proxy servers hosted on Microsoft Azure. This problem causes unexpected CPU blocking, effectively crashing the server at random moments.

Replacing the operating system with an alternative Linux distribution has resolved the occurrence of this problem.

Mitigating actions and future steps

We replaced all proxy servers in the EU1 and BR1 hosting location with proxy servers running on a different operating system family. The proxy server crashes have not occurred since. A team will investigate if the current alternative operating system is the best choice for our proxy servers in the long run. We’re continuing our investigation into the root cause of this problem with the help of external experts.

While investigating the current solution we were able to mitigate the impact by redirecting traffic* using servers hosted in different hosting locations. We’re working to document where this alternative could also be beneficial, and to prepare our network where feasible, so we can quickly implement a similar solution should this be needed in the future.

Several possible improvements in our internal and external communication have been identified. A team is working to improve our communication procedures, adding to our continuous effort to make sure our communication during major incidents is concise, timely, and in line with customer expectations.

We’re investigating if we can improve the detection of availability problems by making browsers aware of the availability of the TOPdesk server. This would allow us to detect problems at an earlier stage, improving the reliability of our product, and helping us gather more relevant error messages.

Time line

From October 22nd our proxy servers in Brazil (BR1) started malfunctioning occasionally. The problems could usually be resolved rather quickly, and only a few customers were affected. We started investigating the issue, but since these problems were infrequent, couldn’t be reproduced, and didn’t occur on other hosting locations with similarly configured and hosted proxy servers, the problem couldn’t be resolved for a long time.

We opened a support ticket at the hosting provider (Microsoft Azure) and we were working with their Linux experts to find the root cause of the problem. We tested several changes in our software and Operating Systems, but the root cause hadn’t been found yet.

On December 14th at 12:00 CET a proxy server in the pool of active proxy servers in EU1 started malfunctioning. According to our recovery plans from the malfunctioning Brazilian proxy servers, engineers tried restarting the server. When this didn’t help we removed the proxy server from the active pool, mitigating all customer impact from 12:50.

An update of the operating system for the proxy servers was identified as a probable root cause, and we started a project to replace the proxy servers in EU1 and BR1 with servers running on a different version of the same operating system.

On December 15th at 15:00 CET another proxy server in EU1 malfunctioned. The server was removed from the pool of active proxy servers, and we continued with our project to replace the proxy servers.

On December 16th early in the morning we noticed that several proxy servers in EU1 were malfunctioning. Since the impact was much bigger this time, the problem was escalated and several actions were started simultaneously;

A team started working on recovering the current proxy servers and expanding the pool of available proxy servers. We were able to increase the amount of available proxy servers by 200%, but the issue still persisted. Our supplier was contacted for additional resources (in case even more proxy servers could reduce the impact) and these were quickly assigned.
We started deploying new proxy servers using a different version of the same operating system that had been created earlier that week. This appeared to work for some time, but soon the same issue occurred.
The status of our internal investigation was escalated to assign more resources to finding the root cause of this problem;
- All SaaS hosting engineers were assigned a role in the investigation. Tasks were divided and engineers were scheduled to continue the investigation outside office hours.
- Several senior software developers were reassigned to help with the investigation.
- From the issues we experienced in Brazil we already had an open support ticket with Microsoft to troubleshoot the issues with our proxy servers on Azure hosted servers. This issue was escalated, and Microsoft Azure engineers were assigned to help with the investigation day and night.
- We contacted a partner firm we regularly work with, and hired temporary SaaS engineers experienced with proxy servers and this specific operating system to help in finding the root cause.
A project was started to investigate if re-routing traffic to the EU1 hosting location via proxy servers in our NL3 hosting location was a feasible workaround to mitigate the impact.
A team started meticulously comparing the proxy servers in EU1 and BR1 to those in other Azure hosting locations where the problem didn’t occur. Since all (proxy) servers are deployed from a template and updated automatically, very few differences exist. We did find minor differences in installed updates, and the number of restarts since the last update. We started testing if this could explain the malfunctions.

On December 17th the investigation continued.

We tested several alternative configurations for the existing proxy servers.
A team was constantly monitoring the load on existing proxy servers. Ensuring all proxy servers in the active pool were working as expected, and recreating servers when they crashed.
We tested and updated our firewall rules to allow redirecting traffic using proxy servers in other hosting locations.
We gathered lots of information with every disruption, allowing the team investigating the root cause to continue their investigation, and disproving several hypothesis regarding possible root causes.

A project was started to recreate our proxy servers using a different operating system. We had some strong indications that the issue was related to the operating system, but so far we had been unsuccessful in finding and resolving the root cause.

On December 18th we were able to start redirecting traffic* for TOPdesk environments in the EU1 hosting location via proxy servers in the NL3 hosting location. The change was implemented gradually to prevent an impact on customers in the NL3 hosting location.

Proxy servers using a different operating system were deployed in EU1, but it turned out these new proxy servers also ran into the same problem after some time. However, with most of the traffic redirected to proxy servers in the NL3 hosting location, the impact for customers was now minimal.

On December 24th, after testing several alternative proxy server solutions, we deployed the current proxy servers (using a different Linux family) in production. The intermittent sudden load issues have not reoccurred on these new proxy servers. We’re back to our regular pool of proxy servers, and no more traffic is re-routed via the NL3 hosting location.

On December 29th we decided to close the major incident in My TOPdesk and to remove the disruption from our Status page, as the new proxy servers had now been running for several days and no customers had reported any problems for over a week.

Our investigation into the root cause of this problem still continues. We still have 2 hosting locations where the proxy servers haven’t been replaced, and the issue has not occurred there. We continue to work with external experts and Azure hosting engineers to find the root cause of this problem.

* Note that while the redirect was in place, customer data was (and is) still stored in the same hosting location. Requests from customers to their TOPdesk SaaS environments took a different route, but ended up at the same destination.

Posted Jan 14, 2021 - 11:17 CET

Resolved

An update regarding the major ticket on the EU1 datacenter.

We have been closely monitoring our infrastructure, performance, and incoming tickets. We have not seen any issues since the last change made on December the 24th.

We will continue working until we are certain that our solution is future proof. We will evaluate this major and the root cause analysis (RCA) will be posted on the status page.

For now, we will close the major ticket and the status page post. Please contact us if you experience any problems with availability or performance.

Posted Dec 29, 2020 - 13:47 CET

Monitoring

Last week we have gradually moved from a temporary fix to a more permanent solution and we have not seen the issues reoccur.

As such we will be closing the major ticket today. Note that we will still monitor the situation and we will continue to work on the permanent solution.
We will be monitoring the issue for some time, as to make sure that the reduced load on our infrastructure due to the holiday period is not the reason it has not reoccurred.

If you encounter any performance or availability problems with your environment; please contact our support department.

Posted Dec 28, 2020 - 13:25 CET

Update

A brief update on the situation at the EU1 data center.

The temporary solution is still in effect and the user experience is back to normal. The Operations team continues to work on a permanent solution together with internal and external expertise.

The alternatives to the proxy servers are being tested and we will update this page when we have made significant progress.

If you experience performance problems, please update your ticket or register a new one on the my.topdesk.com portal.

Posted Dec 23, 2020 - 14:56 CET

Update

The status has been changed to Operational as we've had no impact on customer environments since we implemented a workaround on Friday.

Posted Dec 22, 2020 - 13:18 CET

Update

We're still working to find a permanent solution for this issue. We are testing several alternatives to our current proxy servers, while another team is working with our suppliers and external experts to find and resolve the root cause of this problem.

We don't see any customers that have been affected by this problem since we implemented the workaround on Friday. We also received no new reports of slowness that could be related to the proxy servers since the workaround was implemented.

Although we are still working on a permanent solution, we will be closing all calls linked to this major incident. Closing all calls currently linked to this major incident makes it easier to quickly spot new reports of slowness, and gives our Support team a better overview of the current situation and open questions.

If you do (still) experience slowness while working in TOPdesk, please reach out to TOPdesk Support. The major incident will remain published on My TOPdesk so you can easily log a call if you're affected.

We will keep all customers that have reported an incident, and all our status page subscribers, informed of our progress. We will send another e-mail update when we have made significant progress.

Posted Dec 22, 2020 - 13:13 CET

Update

We're still working to resolve this issue. We don't see any customers that have been affected by this problem today. The load on the proxy servers has been much lower than usual, and the workaround to redirect part of the traffic is still in place.

Today we've been testing alternative systems for our proxy servers. We've made sure the external experts investigating the root cause have all the relevant information for their investigation.

We will post another update on our progress on the status page tomorrow.

Posted Dec 21, 2020 - 16:57 CET

Update

We're still working to resolve this issue. We expect to post another update at the end of the day, unless major changes occur in the current status.

Posted Dec 21, 2020 - 11:44 CET

Update

The solution with re-routed traffic via other proxy servers has been stable the entire weekend. We'll keep this workaround in place until a permanent solution has been found.

TOPdesk will continue to investigate the root cause of this problem, and we'll closely monitor the current situation in the EU1 hosting location.

If you still experience slowness while working in TOPdesk today, please create a new incident in My TOPdesk. You can open the major incident (Broad service disruption) in the right hand menu, and create a call by clicking 'I am affected by this disruption'. This allows us to validate our monitoring results with your customer experience.

Posted Dec 21, 2020 - 09:17 CET

Update

We're still working to resolve this issue.

Several things are happening at the same time:
- One team is investigating the root cause of the problems on our proxy servers, working with internal and external experts to troubleshoot the problems we're experiencing.
- A second team has been working to stabilize the existing proxy servers; creating new servers, spreading the load to different locations, taking malfunctioning servers out of the production resource pool, and resolving these malfunctions.
- Another team is deploying new proxy servers on a different operating system, as this was the most likely culprit.

Regretfully, running our proxy servers on a different operating system results in similar problems. We'll continue to investigate alternatives while the root cause of the problem hasn't been determined and resolved.

By diverting most of the traffic from the EU1 proxy servers to the NL3 proxy servers, most customers now have a stable connection to their TOPdesk SaaS environment. We'll continue to re-route most of the traffic through the NL3 proxy servers until we've found a permanent solution for the malfunctioning servers.

We expect to post a new update on our Status page on Monday, unless major changes occur over the weekend.

Posted Dec 18, 2020 - 15:57 CET

Update

We're still working to resolve this issue.

33% of all traffic to the EU1 hosting location is re-routed via the proxy servers in NL3 and 33% is now routed via new proxy servers on a different operating system. Performance for all customers has improved.

We'll continue to divert traffic from the malfunctioning proxy servers until all traffic is using one of the alternatives.

Posted Dec 18, 2020 - 14:02 CET

Update

We've successfully re-routed part of the EU1 traffic to proxy servers on the NL3 hosting location and this reduces the impact of the problems. We've also deployed new proxy servers on a different operating system, and these do not show the same issues.

During the next few hours we'll be re-routing more traffic via the proxy servers on the NL3 hosting location and via the new proxy servers. We'll continue to work on this problem and we'll monitor the performance of TOPdesk SaaS environments.

Posted Dec 18, 2020 - 12:22 CET

Update

A summary of the status so far;
Starting last Monday (14-12-2020) the proxy servers in our EU1 hosting location started crashing unexpectedly. The crashes cannot be explained by increased load on the servers, nor by changes in the configuration. Several other hosting locations with similar infrastructure do not show the same problem, so it's hard to find the exact root cause. We can't reproduce the problem in our acceptance network, which makes it hard to test possible solutions.

TOPdesk has been working all day and several evenings since Monday to investigate the root cause of this problem. We've contacted internal and external experts on the relevant topics to help the investigation. Experienced engineers from multiple departments have been reassigned to help resolve this problem. We're in direct contact with engineers at our suppliers so they can help the investigation and quickly implement any required changes.

To alleviate the impact, we've deployed additional servers to decrease the load. A team is constantly monitoring the existing servers and is working to resolve problems as they occur. This has helped to keep environments in the EU1 hosting location reachable the past few days.

More resources have been ordered, and are now available, in case we notice additional capacity can further reduce the impact. We are also redirecting part of the traffic to the EU1 hosting location to proxy servers in different hosting locations to further decrease the load on the affected servers. This should help improve the performance starting today.

The proxy servers for the EU1 hosting location have been split into several groups (pools), so we can compare performance for possible solutions. We're also working to reproduce the problem in our acceptance network by adjusting our load tests, so we can run multiple tests without affecting production environments.

During the day we will be testing several promising alternatives that can resolve this problem. We are aware this problem has a severe impact on the performance of TOPdesk SaaS environments in the EU1 hosting location, and we will continue to work on this issue with the highest priority. We aim to provide you with updates on this topic every 2 hours.

Posted Dec 18, 2020 - 11:02 CET

Update

We're still not 100% sure what causes the sudden extreme load on our proxy servers. The investigation continues with internal and external experts on all relevant topics.

To further alleviate the load on the proxy servers, we've redirected part of the traffic for our EU1 proxy servers to the NL3 proxy servers. We keep a close watch on the proxy servers, and will redirect more traffic to the NL3 servers if the performance in the EU1 hosting location does not improve with this reduced load.

Posted Dec 18, 2020 - 09:46 CET

Update

We identified an issue which is likely the root cause for all the problems that customers experience. We're investigating this further and when our investigation is complete we will implement a permanent fix.

In addition, we added a new traffic route for the EU1 SaaS network. The new route is on stand-by and can be quickly activated when bypassing an impacted proxy server is required.

Posted Dec 17, 2020 - 16:06 CET

Update

Tonight we applied a workaround to alleviate the issues caused by the impacted proxy servers
Today we will continue investigating the root cause together with the hosting provider.

Posted Dec 17, 2020 - 10:29 CET

Investigating

We're still investigating what causes the performance problems on our proxy servers in the EU1 hosting location. Several possible root causes have been investigated, but no clear root cause has been found yet.

We're working with the hosting provider and specialists regarding the used operating system and deployed software to investigate the root cause.

At the same time a team has been working to extend the pool of proxy servers and to keep these servers running despite ongoing issues. The proxy servers have been performing better over the afternoon, but traffic to customer environments is still impacted intermittently.

We will continue to investigate the root cause of this problem this evening. Due to reduced availability we don't expect to post new updates on our status page until tomorrow morning, unless the status changes significantly before then.

Posted Dec 16, 2020 - 15:55 CET

Update

We're still working to resolve this issue.

Engineers from the hosting provider are looking into the problems at a host level. At the same time we're still working to identify bottlenecks and other possible root causes, and a team is working to keep the current proxy servers online despite the ongoing issues.

We're also investigating options to alleviate the load on the proxy servers by re-routing some of the traffic.

Posted Dec 16, 2020 - 14:30 CET

Update

We're still working to resolve this issue.

We're cooperating with all suppliers that might be of help, we're ordering additional resources to further increase our spare capacity, and we've involved all colleagues that have relevant knowledge to help troubleshoot the issue.

Posted Dec 16, 2020 - 13:08 CET

Update

We are continuing to work on a fix for this issue.

Posted Dec 16, 2020 - 11:55 CET

Update

TOPdesk environments are still slow or even unreachable at times. We're still working to resolve this problem with the highest priority.

Posted Dec 16, 2020 - 11:55 CET

Update

All environments are reachable and most performance issues have been resolved.

Posted Dec 16, 2020 - 11:24 CET

Update

We've increased the amount of proxy servers in the EU1 hosting location from 2 to 6. This should mitigate all availability issues and most of the slowness experienced while working in TOPdesk.

We are still working to investigate and resolve the root cause of this problem.

Posted Dec 16, 2020 - 11:23 CET

Update

We're still working to resolve the issues in the EU1 hosting location. Next to trying to fix the existing proxy servers, we're also working to deploy new servers to mitigate the impact of any future issues.

Posted Dec 16, 2020 - 10:38 CET

Identified

We're still working to restore all EU1 proxy servers to their normal state.

Posted Dec 16, 2020 - 09:56 CET

Investigating

The proxy servers in our EU1 hosting location are not performing as expected. This may result in slow response times while working in TOPdesk or unreachable TOPdesk environments.

TOPdesk is aware of the problem and is working on a solution.

Posted Dec 16, 2020 - 09:23 CET

This incident affected: EU1 SaaS hosting location.