Disruption in NL3 hosting location
Incident Report for TOPdesk SaaS Status page
Postmortem

Issue summary

On Friday July 3rd at 19:02 (07:02 PM) CEST our monitoring system alerted the standby operator that several TOPdesk environments in the NL3 hosting location were unable to connect to their database. The standby operator investigated the issue and noticed that the primary database server for those environments was very slow to respond.

As per procedures, a failover to the secondary database server was started. This process took much longer than expected. During the database failover, environments on 2 other database servers also started getting connection issues. All slow database servers were hosted on the same physical host machine. We contacted the hosting provider and the machine was placed in maintenance mode, moving all servers to different locations.

Once the host had been emptied, the problematic database servers were restated and all database connections restored automatically.

Root cause

One machine in the NL3 hosting location experienced intermittent connectivity issues to the storage that was used. This caused performance issues on all database servers hosted on that machine.

Action points

The primary/secondary database failover system that is used was identified as a risk earlier. TOPdesk has already started a project to migrate all database servers to SQL Enterprise with an availability cluster. This project is scheduled to be complete in Q1 2021.

As we’ve had another issue with the storage on the NL3 hosting location, the hosting provider was already investigating this issue. To further troubleshoot the problem, all relevant cables and network cards on the hosts involved have been replaced. The physical host was also assigned different ports on the switch, and different switchports on the storage.

Network congestion was identified as a possible root cause of both issues by the hosting provider, but was not detected during this disruption. As there was network congestion during some nightly maintenance, the uplink capacity will be expanded to prevent future issues.

The hosting provider has improved their monitoring system alert rules so standby operators are immediately informed when storage is unreachable.

TOPdesk will investigate options to show a clearer error page to end users in case of an issue with the database connection.

Posted Jul 13, 2020 - 15:07 CEST

Resolved
Problems on the database servers have been solved in cooperation with our hosting provider. All TOPdesk environments are available again.
Posted Jul 03, 2020 - 21:12 CEST
Update
The implemented solution worked partially. Still some database servers are not recovered yet.

We are working together with our hosting provider to find a solution.
Posted Jul 03, 2020 - 20:35 CEST
Update
We have detected a problem in a group of database servers and are implementing a solution.
Posted Jul 03, 2020 - 19:54 CEST
Investigating
We are currently experiencing problems on the NL3 hosting location. As a result your TOPdesk environment may not be available.

We are aware of the problem and working on a solution.

Our apologies for the inconvenience. The current status can be found on our TOPdesk Status Page: https://status.topdesk.com

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident 20 07 1243.
Posted Jul 03, 2020 - 19:28 CEST
This incident affected: NL3 SaaS hosting location.