Performance degradation in the NL3 data center

Incident Report for TOPdesk SaaS Status page

Postmortem

Root cause analysis (RCA) for major TDR20 09 6276.
Starting on September 21st, we had problems on our SaaS network which involved slowness while working in TOPdesk, short periods of unavailability, and the Reporting Service (Duration reports and Odata feed) being disabled for nearly two days. This RCA aims to inform everyone affected by these problems regarding what went wrong, what we did to mitigate the problems, the lessons learned, and the actions set out to prevent future issues.

Timeline and technical details (all times are in CEST):
2020-09-21
09:00 We received the first report of a customer at our NL3 hosting location experiencing slowness while working in their TOPdesk environment.
09:21 An increase in the number of incidents classified as performance being logged triggered an automated major incident process to log and investigate these issues. Our Technical Support and SaaS Operations team began an investigation.
09:24 The SaaS Operations team investigated the issue and was able to locate the cause of issue to an unexpected high load on Kubernetes nodes, running several customer facing services. The affected services included the authentication service (passlayer) which handles all customer traffic.
Frequently during moments of high customer traffic services were competing for resources, causing customer traffic to be delayed and causing slowness in response to customer input. Engineers were investigating the origin of services using more resources than normal.
09:43 Services were moved to different Kubernetes nodes, and the problems for customers appear to be resolved. The internal major incident is closed.
11:28 A new automated major is created. The Support tech team and the SaaS Operations team continue their investigation. The team determined the issues were different from earlier that day and started investigating other possible common causes.
16:20 The performance issues were published on the Status page.

2020-09-22
09:30 The SaaS Operations teams continues to investigate the high CPU load on the Kubernetes cluster. The high CPU load seems to be caused by the reporting service and the audit trail service which are using significantly more CPU than expected. Meetings are started with Operations and Development to determine the cause of the high CPU load.
14:40 The responsible Development team and the SaaS Operations team notice that the passlayer service (a key service for the proper functioning of the environment) is unavailable. The Operations teams restarts the pods to resolve the issue.
14:50 The Operations team disables the reporting service and the audit trail service, reducing load on the systems and decreasing the impact on performance for customers. Though the performance issues were alleviated, engineers continue to investigate the underlying issue.
15:05 The Operations team enables the audit trail service again
16:00 The monitoring indicates that the load is good without the reporting service. The team will continue to monitor the load/performance of Kubernetes nodes on Wednesday morning.

2020-09-23 08:30 The investigation by Operations continues. 12:00 Operations discovers that the underlying infrastructure managed by our hosting partners was reporting resource constraints on the physical hardware. This causes the machines to have insufficient resources to handle the required traffic. Engineers made changes to machine layout and reached out to our hosting partner to discuss impact of potential configuration changes. 13:00 Operations determines this to be the root cause of the issue and reach out to the hosting partner Engineering team for a discussion on implementing reservations. 21:00 The Operations team re-configures the virtual machines for a better fit between virtual hosts and physical CPU sockets.

2020-09-24 09:00 The Operations team implements resource reservations for machines providing compute resources to services after advice from engineers from our hosting partner. This reservation mitigated the observed starvation causing the latency. 11:00 In the afternoon engineers implemented resource limits on certain services preventing them from overloading machines at peak time. The reporting service was once again made available in the afternoon to our customers. 13:00 The SaaS operations team continued to observe the performance of services during the day and the changes seem to have a positive effect. 13:30 CPU limits for the services are set and the Reporting service is brought back online.

2020-09-25
09:00 The Operations teams monitors and review the metrics; they are all within the expected limits.

2020-09-28
18:00 After discussing with all involved teams, the major incident is closed.

Root cause
Due to a combination of high customer traffic, reporting services being used more extensively than before, and a configuration limitation in our hosting platform; key services were starved of computing resources causing them to be unable to keep up with customer traffic.

Follow up actions

We realize that this disruption included problems with the Odata feed functionality which for a number of TOPdesk users is a vital part of the process. Since TOPdesk is moving more towards a SaaS solution with separated services, we aim to communicate in a more proactive way when it comes to the unavailability of certain services.

Development:
Some of the services did not meet the requirements set by our software architects, which hindered troubleshooting. Development will be made aware that all services are required to meet these standards before being used in production. Follow-up checks to verify all services have been adjusted accordingly have also been scheduled.
Our Development department started an investigation to see if Development teams can have a more proactive role in troubleshooting problems in a specific service.
We started to investigate the option of rate limiting or queuing for non-critical services.

SaaS Support:
The communication strategy has been reviewed with representatives from several departments. A project has been started to review all communication procedures during a major incident.

SaaS Operations:
We've improved our monitoring to better detect resource starvation issues.
We're working to create resource reservations for system-critical virtual machines and all services.
Design choices of the virtual machines and physical hosts will be evaluated with our hosting provider to ensure they match the infrastructure that is used.
A project to reduce the size (amount of customers) of the NL3 hosting location has been scheduled. This will limit the impact of future disruptions.
Together with Development load tests on services are investigated to improve the stability of the key services.

Posted Oct 05, 2020 - 15:44 CEST

Resolved

The problems which started last week (2020-09-21) have been fixed, and the reporting service (for Duration reports and the Odata feed) is available again. Our apologies for any inconvenience this may have caused. If you experienced any issues since then, please contact TOPdesk Support.

TOPdesk will:
* Continue to monitor our performance metrics and act on any issues
* Evaluate all the majors internally (including the communication process)
* Provide a root cause analysis within one week with detailed information
* Present plans to prevent this from happening again

We will close the major and the linked tickets. Should you have any feedback on this disruption please let us know by updating your existing ticket.
For more information on our Major Incident process, please see KI 13508.

Posted Sep 28, 2020 - 18:07 CEST

Identified

We're still seeing some performance issues in the NL3 hosting location. So far not one common cause has been identified, but several smaller issues have been found and resolved.

Please reach out by creating an incident in My TOPdesk if you still experience any performance issues today.

Posted Sep 28, 2020 - 10:43 CEST

Update

We are monitoring the situation and we see no problems with performance or the reporting service.

We are planning to close the major around noon. If you encounter any problems, please update your ticket.

Posted Sep 25, 2020 - 10:09 CEST

Update

Our SaaS Operations team has been working on a fix. Currently the performance is back to what it should be and we are closely monitoring the Duration Reports & Odata feed functionalities. Tomorrow we will evaluate the situation and determine the strategy for the next few days.

We ask you to not use any heavy Odata feed extractions until we can safely say that the fix works.

We strive to have all functionalities fully operational with solid performance by Monday. However we are cautious and cannot guarantee this yet.

Posted Sep 24, 2020 - 17:10 CEST

Update

Hereby more information regarding the current problems with performance and the reporting (Odata) services.

The root cause of the problem is that the virtual machines which are hosting our new services architecture have become overloaded. The is due to an increase in the overall traffic and an increase in the usage of the reporting layer. To remedy the performance issues we have temporarily disabled the reporting service (Duration reporting and the Odata feed).

We realize that the Odata feed can be a vital part of your process and therefor we are researching if we can reactivate the reporting service.

For a long term solution we will evaluate the resources allocation in the new services architecture, more on this will follow in the RCA.

Posted Sep 24, 2020 - 13:55 CEST

Update

Our SaaS Operations team continues to work on the problem.

It seems that multiple causes have put a too high load on the hardware of the services architecture.

Currently we are in contact with the supplier of the hardware to find a fitting solution.

Later today we will publish a more detailed update on the situation, the cause and the steps we will take to remedy this.

Posted Sep 24, 2020 - 10:55 CEST

Update

The SaaS Operations team has decided to disable the Odata feed for a longer period of time. We will enable the Odata feed as soon as possible but stability and performance have our priority. When we have an estimated time we will post this on the status page.

Along with closely monitoring the situation, our Operations team has contacted our supplier to make sure that the SaaS environments will remain stable after we enable all the features.

Posted Sep 23, 2020 - 17:22 CEST

Monitoring

Our SaaS team has been closely monitoring the resources on the NL3 data center. We have implemented a temporary fix and the performance should be back to normal.

A part of this temporary fix is to disable the reporting service. The reporting service is part of the new architecture and will affect Duration reports and Odata reports, the standard reports in TOPdesk will continue to work.

We are planning to enable the reporting service again after working hours with reduced resources.

A long term solution will involve changes made by our development teams, these changes will take more time and will be mentioned in the RCA.

Posted Sep 23, 2020 - 13:05 CEST

Update

We are closely monitoring the services and the resources and we are seeing that the impact on performance has been reduced. We will continue to monitor the situation.

We are not enabling the reporting service yet (for Duration reports and the Odata feed), we want to make sure that the issues are resolved before enabling these services. Later today we will post a decision regarding the re-enabling of the reporting service.

Posted Sep 23, 2020 - 11:06 CEST

Update

The SaaS Operations team is closely monitoring the resources of the services.

Within an hour we will give an update.

Posted Sep 23, 2020 - 09:56 CEST

Identified

A summary of the problems we have been having on our NL3 data center:
On Monday we noticed one of our services was requiring a lot more resources than normal, which negatively affected the performance of your TOPdesk environment. During the day it seemed like the problems were hardware related and the major incident was closed.

Later on this same day the problems started coming back around noon. It was not clear whether the issues had the same cause, thus we created a new major incident. We investigated until the end of the day and continued to monitor the resources.

On Tuesday around peak times (10:00 AM) we noticed the same problems with the services and the resources. A more in depth investigation found a likely culprit, the reporting service. We planned to disable this service on Wednesday during the next peak times, however later on this Tuesday the problems came back.

Around 14:00 our SaaS Operations team noticed the issues were affecting the authentication service. To resolve this, resources have been re-balanced to have this key service function again. Furthermore, we have disabled the reporting service for now. This means that Duration reports as well as the Odata feed will not work at this moment.

The plan to monitor the new situation closely on Wednesday morning still stands, we will closely monitor the resources and we will determine a strategy to bring this service back online as soon as possible with stability as the most important factor.

Posted Sep 22, 2020 - 17:02 CEST

Update

We have now briefly experienced problems with the authentication service, this will have resulted in temporary downtime.

In order to swiftly react we are disabling the service responsible for reporting. This means that the duration reporting functionalities will not be working in order to make sure that our other services stay online. The standard TOPdesk reports will continue to work.

We will update you with a more detailed plan.

Posted Sep 22, 2020 - 14:55 CEST

Update

Our SaaS operations team suspects that the problems from today are originating from specific key services which are affecting the performance.

If you have information about a certain time period of degraded performance, please let us know by updating your ticket so we can correlate this to our monitoring.

Posted Sep 22, 2020 - 11:57 CEST

Update

The performance problems which started yesterday (2020-09-21) still appear to affect environments on the NL3 data center.

We will contact the callers of newly created tickets to investigate the cause of the problems.

If your environment is affected, please create a new ticket from the major on my.topdesk.com.

Posted Sep 22, 2020 - 10:53 CEST

Update

We are continuing to investigate this issue.

Posted Sep 21, 2020 - 18:22 CEST

Update

We are continuing to investigate this issue.

Posted Sep 21, 2020 - 17:54 CEST

Update

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR20 09 6276.

Posted Sep 21, 2020 - 16:33 CEST

Investigating

A few customers in our NL3 hosting location reported performance issues. We're investigating if a common cause can be found for these problems.

It is different from the problem experienced this morning and we expect to give more updates as soon as we have more information or find out the root cause of it.

Posted Sep 21, 2020 - 16:20 CEST

This incident affected: NL3 SaaS hosting location.