On Sunday June 28th at 19:15 (07:15 PM) CEST our monitoring system alerted the standby operator that several servers in the NL3 hosting location were unavailable. The standby operator investigated the issue and noticed that the servers were unreachable, and datacenter management tools were not working.
The standby engineer contacted the hosting provider to continue troubleshooting this issue, and the hosting provider confirmed the issue. The hosting provider started investigating the issue with the highest priority, and promptly escalated the problem when it could not be resolved soon. Additional engineers were called in, and after a long troubleshooting session the issue appeared to be fully resolved at 02:00 (AM) on Monday.
At 07:25 (AM) on Monday one of our engineers noticed that a part of the infrastructure in the NL3 hosting location had not recovered fully. The issue was resolved in a few minutes, meaning some features in TOPdesk could have malfunctioned until 7:30 AM.
Root cause
A port on the storage array used to host databases and files for TOPdesk SaaS environments was malfunctioning. This issue was hard to detect, as the hosting provider used the same storage array for it’s management systems and monitoring system.
Action points
The hosting provider will make sure the monitoring system for the storage array uses a different storage system. A separate investigation is started to find the root cause of the failing monitoring system.
TOPdesk has identified some points of improvement in the communication towards customers and with the hosting provider. We’ll follow up on the internal communication improvements, and will contact the hosting provider to further improve our collaboration during disruptions that affect our customers.
Update 13-07-2020
To further troubleshoot the problem, all relevant cables and network cards on the machine have been replaced by the hosting provider. The team also tested assigning a machine different ports on the switch, and different switchports on the storage.
As there was some network congestion detected, the uplink capacity will be expanded to prevent future issues.
The hosting provider has added alert rules on its monitoring system so their standby operators are immediately informed when storage is unreachable.