RESOLVED: NL4 degraded performance

Incident Report for TOPdesk SaaS Status page

Postmortem

Incident Summary

On September 9th, customers began reporting slowness in their TOPdesk SaaS environment at around 8:45 AM CEST. This prompted an immediate investigation by our SaaS team, leading to the discovery of an unexpected issue with CPU readiness in our NL4 cluster. Certain virtual machines (VMs) exhibited high CPU readiness, despite their host machines only showing moderate CPU usage. This anomaly quickly spread, affecting more machines and necessitating a comprehensive response.

Our response included initiating an emergency call, publishing an update on our status page to inform our customers, and redistributing workloads across different hosts to alleviate the issue. We logged a ticket with our hosting provider for escalation and engaged our internal support teams to identify any specific virtual machines contributing to the high CPU usage. Regular updates were maintained on our status page, and we communicated with affected customers throughout the incident. Specific VMs with unusually high CPU usage were identified, and further analysis and monitoring were conducted to ensure stability.

Resolution

After continuous monitoring and adjustments, the CPU readiness issue showed signs of improvement. By the following day, the status had stabilized, and the incident was marked as resolved.

Root Cause Analysis (RCA)

The exact root cause remains undetermined. However, collaborative investigation results suggest that a combination of high workload virtual machines and potential misconfigurations in resource scheduling contributed to the CPU readiness issues. Our hosting provider suggested that the Dynamic Resource Scheduler (DRS) may have initially caused the problem, with the situation stabilizing once the workload was redistributed appropriately.

Conclusion

To prevent such issues in the future, we will maintain ongoing collaboration with our hosting provider to ensure prompt resolution of any future issues and to better understand the underlying infrastructure. We apologize for any inconvenience caused by this incident. Our teams are dedicated to ensuring the stability and reliability of our services and will continue to take proactive measures to prevent such issues in the future. We appreciate your understanding and continued support. For any further questions or concerns, please don't hesitate to reach out to your local support department.

Posted Oct 23, 2024 - 10:08 CEST

Resolved

We are pleased to announce that, following a few hours of close observation, the system has shown consistent stability. We apologize for any inconvenience this may have caused and thank you for your patience and understanding.

The major incident will now be closed, and an internal evaluation will be conducted. Once the evaluation is complete, a detailed root cause analysis will be published.

Posted Sep 10, 2024 - 13:25 CEST

Update

We are pleased to inform you that our engineers have not detected any further issues this morning. The system is currently stable, and we will continue to monitor it closely.
We are still awaiting a response from Leaseweb regarding the investigation into the host machine.

If you experience any signs of degraded performance with TOPdesk, please do not hesitate to contact our Support department.

We will provide further updates as soon as more information becomes available.
Thank you for your patience and understanding.

Posted Sep 10, 2024 - 10:42 CEST

Update

While we have made some progress, the root cause has not yet been identified. Our investigation is ongoing as we carefully examine various potential factors that could be contributing to this issue. We are working closely with Leaseweb to ensure a comprehensive analysis.

For now, we will continue to monitor the situation and provide further updates as soon as more information becomes available.

Thank you for your patience and understanding.

Posted Sep 09, 2024 - 17:10 CEST

Update

Our team is actively working to determine the cause of the significantly higher CPU usage observed today compared to last week.

Concurrently, Leaseweb is conducting a thorough investigation of the affected Virtual Machines.

We appreciate your patience and understanding. Further updates will be provided as soon as more information becomes available.

Posted Sep 09, 2024 - 15:24 CEST

Monitoring

We have identified a potential cause for the disruption: a significant number of API calls were being made, which coincided with the period of high CPU usage.
Once these API calls ceased, the CPU usage on the affected application server dropped significantly, and the CPU usage on other application servers also returned to stable levels.

While this finding is promising, we are continuing to investigate and will be reaching out to relevant parties to gather more information to confirm if this is indeed the root cause.

We appreciate your patience and will provide further updates as soon as we have more details.

Posted Sep 09, 2024 - 13:27 CEST

Update

Our engineers are continuing to investigate the issue with degraded performance on NL4.

Around 11:55 CET, we observed a significant drop in CPU usage on several application servers, and the system has appeared more stable since then.

We will remain in close contact with Leaseweb and continue to closely monitor the situation.

The next update will be posted as soon as we have more information.

Posted Sep 09, 2024 - 12:38 CEST

Update

Leaseweb is currently conducting an internal investigation on the affected machines.

While a few customers have reported that their TOPdesk is stable again, we cannot guarantee that this stability will persist until we identify and completely resolve the root cause of the issue.

We appreciate your patience and will continue to provide updates as we make progress.

Posted Sep 09, 2024 - 10:49 CEST

Update

Our engineers are actively investigating the cause of the unresponsiveness in TOPdesk environments on NL4. To mitigate the impact, we have re-allocated resources.

Additionally, we have reached out to Leaseweb to review the affected Virtual Machines.

We will continue to monitor closely and provide another update as soon as possible.

Posted Sep 09, 2024 - 10:09 CEST

Investigating

We are currently experiencing issues at the NL4 hosting location, which may result in your TOPdesk environment being slower than usual or temporarily unavailable.

We are aware of the problem and are working on a solution.

Our apologies for the inconvenience. At the time of writing this we are not able to give you an estimate on when your environment will be available. We aim to update this status page every 30 minutes until the issue has been resolved.

E-mail updates will be sent when the issue has been resolved. You can subscribe on the status page (https://status.topdesk.com) for additional updates.

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR24 09 2495.

Posted Sep 09, 2024 - 09:28 CEST

This incident affected: NL4 SaaS hosting location.