On September 9th, customers began reporting slowness in their TOPdesk SaaS environment at around 8:45 AM CEST. This prompted an immediate investigation by our SaaS team, leading to the discovery of an unexpected issue with CPU readiness in our NL4 cluster. Certain virtual machines (VMs) exhibited high CPU readiness, despite their host machines only showing moderate CPU usage. This anomaly quickly spread, affecting more machines and necessitating a comprehensive response.
Our response included initiating an emergency call, publishing an update on our status page to inform our customers, and redistributing workloads across different hosts to alleviate the issue. We logged a ticket with our hosting provider for escalation and engaged our internal support teams to identify any specific virtual machines contributing to the high CPU usage. Regular updates were maintained on our status page, and we communicated with affected customers throughout the incident. Specific VMs with unusually high CPU usage were identified, and further analysis and monitoring were conducted to ensure stability.
After continuous monitoring and adjustments, the CPU readiness issue showed signs of improvement. By the following day, the status had stabilized, and the incident was marked as resolved.
The exact root cause remains undetermined. However, collaborative investigation results suggest that a combination of high workload virtual machines and potential misconfigurations in resource scheduling contributed to the CPU readiness issues. Our hosting provider suggested that the Dynamic Resource Scheduler (DRS) may have initially caused the problem, with the situation stabilizing once the workload was redistributed appropriately.
To prevent such issues in the future, we will maintain ongoing collaboration with our hosting provider to ensure prompt resolution of any future issues and to better understand the underlying infrastructure. We apologize for any inconvenience caused by this incident. Our teams are dedicated to ensuring the stability and reliability of our services and will continue to take proactive measures to prevent such issues in the future. We appreciate your understanding and continued support. For any further questions or concerns, please don't hesitate to reach out to your local support department.