Root Cause Analysis (RCA) Report
Summary:
On the morning of September 6th, 2024, at approximately 09:05 CEST, we began receiving reports from our customers stating that their TOPdesk environments were unavailable or unreachable. An initial check by our Support department indicated that these reports were all originating from customer environments hosted in the NL4 datacenter, specifically within container NL4C01.
Timeline of Events:
At 09:10 CEST, monitoring showed that the SQL server was having issues. By 09:15 CEST, the issue was noticed by our Operations team. By 09:40 CEST, the Operations team confirmed that restarting the secondary SQL node would resolve the issue. By 09:52 CEST, it was confirmed that all systems should be operational again.
Resolution:
Our engineers determined that the reported issues were caused by a misbehaving host machine. To mitigate the impact on our customers, all environments within container NL4C01 were failed over to backup infrastructure at approximately 09:40 CEST.
Future Preventive Measures:
Moving forward, we will continue to ensure prompt responses to similar incidents to mitigate disruptions quickly.
Additionally, we will review and improve our monitoring systems and escalation procedures to ensure quicker detection and resolution of similar issues in the future.
Conclusion:
The incident was caused by issues with a host machine in the NL4 datacenter, specifically within container NL4C01. Swift actions were taken to fail over to backup infrastructure and resolve the issue.