SaaS disruption UK1

Incident Report for TOPdesk SaaS Status page

Postmortem

During synchronizing databases within the database cluster the primary node became unresponsive and received an automated reboot signal from STONITH. The secondary node was still synchronizing the databases and therefore wasn't able to promote itself, which caused the cluster to become unstable and unable to select a primary.
The first node wasn't able to recover automatically and needed manual intervention for recovery and was subsequently manually forced to become the master node.

In order to have the above situation to occur the following conditions needed to be met:

An unresponsive SQL server which receives a reboot from the fencing agent with STONITH resulting in a failover
A minute prior to the failover a new SQL database needs to be created on a primary SQL server and not reseeded to the secondary SQL server

‌

As such, a combination of conditions is exceedingly rare so we do not expect similar outages to occur in the future.
On top of the above:
In the unlikely event of a similar outage occurring in the future we have prepared a script to more quickly recover from the outage.
Lastly, we will train our operations staff how to act during these outages in the future.

Posted Feb 13, 2023 - 10:57 CET

Resolved

This incident has been resolved.

Posted Jan 31, 2023 - 17:03 CET

Monitoring

We have identified the cause of the disruption and have implemented a fix.
All services and environments are now available.

We continue to monitor the situation.

Posted Jan 31, 2023 - 15:27 CET

Update

Our engineers are currently investigating the cause of the outage.

Posted Jan 31, 2023 - 14:46 CET

Update

We are continuing to investigate this issue.

Posted Jan 31, 2023 - 14:19 CET

Investigating

We are currently experiencing problems on the UK1 hosting location. As a result your TOPdesk environment may not be available.

We are aware of the problem and are working on a solution.

Our apologies for the inconvenience. At the time of writing this we are not able to give you an estimate on when your environment will be available. We aim to update this status page every 30 minutes until the issue has been resolved.

E-mail updates will be sent when the issue has been resolved. You can subscribe on the status page (https://status.topdesk.com) for additional updates.

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR23 01 8932.

Posted Jan 31, 2023 - 14:17 CET

This incident affected: UK1 SaaS hosting location.