Time line and actions taken:
Friday September 25th at 15:54 (3:43PM) CEST our monitoring system informed us there were several issues with the infrastructure in our NL3 hosting location. At the same time, several calls came in at TOPdesk Support, reporting problems with the availability of TOPdesk SaaS environments. In a few minutes we confirmed nearly all Virtual Machines (VM’s) in the NL3 hosting location had some kind of networking issue, and we contacted our hosting provider to troubleshoot the issue.
At 16:13 the hosting provider confirmed the issue and escalated our ticket to an engineer on site. About 15 minutes later the VM’s were again reachable for management actions. We noticed several servers were not working as expected and many TOPdesk environments were no longer able to connect to their database.
At 16:40 the team working to resolve the issue split up the tasks to restart several key infrastructure servers and schedule restarts for all TOPdesk environments that lost their database connection. At 17:40 all key infrastructure servers were running again and we started restarting the affected TOPdesk environments.
At 18:30 most TOPdesk environments were working, but several remaining issues still had to be resolved. During the evening we found some remaining infrastructure issues and eventually decided to restart all TOPdesk environments in NL3 during the night as several customers whose environment hadn’t been restarted reported mail import issues.
Root cause:
As we had reported slowness on our infrastructure earlier that week, our hosting provider was executing some tests on the infrastructure. During one of these tests an error occurred, which caused networking issues preventing our servers from contacting each other.
What’s next?
TOPdesk will evaluate the disruption with the hosting provider. We’ll make sure procedures at the hosting provider will be adjusted to prevent maintenance during production hours that might affect the availability of TOPdesk environments.
Uptime reports for customers lack information about this disruption as the monitoring server in the NL3 hosting location was also unavailable. We do have cross-datacenter checks that detected the networking issues, but this does not show up in uptime reports for individual environments. Resolving this issue will require a complete re-write of our monitoring system or the use of a different uptime reporting tool, so this will take some time. A project to investigate replacements for the uptime calculation tool has to be scheduled.
Several improvements in internal scripts to execute bulk actions have been identified. These will be resolved during regular maintenance.
An investigation to improve the stability of our infrastructure will be started to identify ways to improve its stability after disruptions. We also want to investigate tools to automate the recovery (restart or reset to desired state, all in the correct order) of virtual machines to hasten the recovery when a similar situation occurs.
TOPdesk environments lost their database connection during the outage. This caused an additional downtime as affected environments had to be restarted. We have started an investigation into the root cause of these database connection issues and ways to prevent it from occurring again.