RESOLVED: SaaS disruption UK1

Incident Report for TOPdesk SaaS Status page

Postmortem

Incident Summary:

On the morning of February 23rd, 2024, at approximately 11:25 CET, we began receiving reports from our customers about experiencing slowness in their TOPdesk environments, hosted on the UK1 datacenter.

Upon receiving these reports, our Cloud Engineers initiated an investigation. They determined that the reported slowness was triggered by long-running queries, a consequence of outdated statistics on the authorization service database. A solution was implemented at 12:57 CET, which led to an immediate improvement in performance.

Simultaneously, we noticed a steady increase in our message queue, with requests predominantly originating from the audit trail service. In a combined effort, our Cloud engineers and developers from the audit trail service took immediate action at 13:07 CET. By rebooting the pods to restore the connection to the messaging system and deploying additional pods to handle the accumulated messages, we were able to effectively clear the queue.

We are pleased to report that the main issue was fully resolved by 13:20 CET. The entire disruption, from the initial report to the final resolution, lasted approximately 1 hour and 40 minutes.

Following this incident, we have implemented measures to foster even stronger collaboration between our teams, thus enhancing our response time should such an issue arise in the future.

We appreciate your understanding and patience as we continue to work towards providing a more reliable service.

Posted Mar 22, 2024 - 09:34 CET

Resolved

Our SaaS and development team has pinpointed an issue regarding slower queries originating from the authorization service. This has led to messages piling up and not being processed timely.

We've implemented a solution in the SQL server and deployed extra pods to catch up with the message queue. Our monitoring shows that the performance of TOPdesk instances have returned back to normal.

We will proceed to evaluate this issue internally. Upon its completion, a Root Cause Analysis (RCA) will be posted on our status page for your reference.

If you continue to experience any issues, kindly reach out to our support team for assistance. We appreciate your patience and understanding in this matter.

Posted Feb 23, 2024 - 13:16 CET

Update

Our SaaS Operations team is currently investigating issues related to our authorization service and are working on a solution.

We will send an additional update in approximately 30 minutes.

Posted Feb 23, 2024 - 12:50 CET

Investigating

We are currently experiencing problems on the UK1 hosting location. As a result you may experience slowness in your TOPdesk environment.

We are aware of the problem and are working on a solution.

Our apologies for the inconvenience. At the time of writing this we are not able to give you an estimate on when your environment's performance will return back to normal. We aim to update this status page every 30 minutes until the issue has been resolved.

E-mail updates will be sent when the issue has been resolved. You can subscribe on the status page (https://status.topdesk.com) for additional updates.

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR24 02 7637.

Posted Feb 23, 2024 - 12:12 CET

This incident affected: UK1 SaaS hosting location.