On Friday July 3rd at 19:02 (07:02 PM) CEST our monitoring system alerted the standby operator that several TOPdesk environments in the NL3 hosting location were unable to connect to their database. The standby operator investigated the issue and noticed that the primary database server for those environments was very slow to respond.
As per procedures, a failover to the secondary database server was started. This process took much longer than expected. During the database failover, environments on 2 other database servers also started getting connection issues. All slow database servers were hosted on the same physical host machine. We contacted the hosting provider and the machine was placed in maintenance mode, moving all servers to different locations.
Once the host had been emptied, the problematic database servers were restated and all database connections restored automatically.
One machine in the NL3 hosting location experienced intermittent connectivity issues to the storage that was used. This caused performance issues on all database servers hosted on that machine.
The primary/secondary database failover system that is used was identified as a risk earlier. TOPdesk has already started a project to migrate all database servers to SQL Enterprise with an availability cluster. This project is scheduled to be complete in Q1 2021.
As we’ve had another issue with the storage on the NL3 hosting location, the hosting provider was already investigating this issue. To further troubleshoot the problem, all relevant cables and network cards on the hosts involved have been replaced. The physical host was also assigned different ports on the switch, and different switchports on the storage.
Network congestion was identified as a possible root cause of both issues by the hosting provider, but was not detected during this disruption. As there was network congestion during some nightly maintenance, the uplink capacity will be expanded to prevent future issues.
The hosting provider has improved their monitoring system alert rules so standby operators are immediately informed when storage is unreachable.
TOPdesk will investigate options to show a clearer error page to end users in case of an issue with the database connection.