Incident Summary:
On the morning of February 23rd, 2024, at approximately 11:25 CET, we began receiving reports from our customers about experiencing slowness in their TOPdesk environments, hosted on the UK1 datacenter.
Upon receiving these reports, our Cloud Engineers initiated an investigation. They determined that the reported slowness was triggered by long-running queries, a consequence of outdated statistics on the authorization service database. A solution was implemented at 12:57 CET, which led to an immediate improvement in performance.
Simultaneously, we noticed a steady increase in our message queue, with requests predominantly originating from the audit trail service. In a combined effort, our Cloud engineers and developers from the audit trail service took immediate action at 13:07 CET. By rebooting the pods to restore the connection to the messaging system and deploying additional pods to handle the accumulated messages, we were able to effectively clear the queue.
We are pleased to report that the main issue was fully resolved by 13:20 CET. The entire disruption, from the initial report to the final resolution, lasted approximately 1 hour and 40 minutes.
Following this incident, we have implemented measures to foster even stronger collaboration between our teams, thus enhancing our response time should such an issue arise in the future.
We appreciate your understanding and patience as we continue to work towards providing a more reliable service.