Disruption in NL3 hosting location

Incident Report for TOPdesk SaaS Status page

Postmortem

Time line and actions taken:

Friday September 25th at 15:54 (3:43PM) CEST our monitoring system informed us there were several issues with the infrastructure in our NL3 hosting location. At the same time, several calls came in at TOPdesk Support, reporting problems with the availability of TOPdesk SaaS environments. In a few minutes we confirmed nearly all Virtual Machines (VM’s) in the NL3 hosting location had some kind of networking issue, and we contacted our hosting provider to troubleshoot the issue.

At 16:13 the hosting provider confirmed the issue and escalated our ticket to an engineer on site. About 15 minutes later the VM’s were again reachable for management actions. We noticed several servers were not working as expected and many TOPdesk environments were no longer able to connect to their database.

At 16:40 the team working to resolve the issue split up the tasks to restart several key infrastructure servers and schedule restarts for all TOPdesk environments that lost their database connection. At 17:40 all key infrastructure servers were running again and we started restarting the affected TOPdesk environments.

At 18:30 most TOPdesk environments were working, but several remaining issues still had to be resolved. During the evening we found some remaining infrastructure issues and eventually decided to restart all TOPdesk environments in NL3 during the night as several customers whose environment hadn’t been restarted reported mail import issues.

Root cause:

As we had reported slowness on our infrastructure earlier that week, our hosting provider was executing some tests on the infrastructure. During one of these tests an error occurred, which caused networking issues preventing our servers from contacting each other.

What’s next?

TOPdesk will evaluate the disruption with the hosting provider. We’ll make sure procedures at the hosting provider will be adjusted to prevent maintenance during production hours that might affect the availability of TOPdesk environments.

Uptime reports for customers lack information about this disruption as the monitoring server in the NL3 hosting location was also unavailable. We do have cross-datacenter checks that detected the networking issues, but this does not show up in uptime reports for individual environments. Resolving this issue will require a complete re-write of our monitoring system or the use of a different uptime reporting tool, so this will take some time. A project to investigate replacements for the uptime calculation tool has to be scheduled.

Several improvements in internal scripts to execute bulk actions have been identified. These will be resolved during regular maintenance.

An investigation to improve the stability of our infrastructure will be started to identify ways to improve its stability after disruptions. We also want to investigate tools to automate the recovery (restart or reset to desired state, all in the correct order) of virtual machines to hasten the recovery when a similar situation occurs.

TOPdesk environments lost their database connection during the outage. This caused an additional downtime as affected environments had to be restarted. We have started an investigation into the root cause of these database connection issues and ways to prevent it from occurring again.

Posted Oct 06, 2020 - 10:13 CEST

Resolved

All the environments are back online, if you encounter any problems please update your ticket.

Posted Sep 25, 2020 - 19:16 CEST

Update

Most of the environments are back online, we are still working on getting the remaining environments up and running.

Posted Sep 25, 2020 - 18:28 CEST

Update

An issue with our SaaS management tooling preventing the restarts from being executed. The restarts have been rescheduled and we now expect the last environments to come online around 18:15.

Posted Sep 25, 2020 - 17:46 CEST

Update

Restarts for all environments with database connectivity issues have been scheduled. We expect the last environment to be online at 18:00 CEST.

Posted Sep 25, 2020 - 17:28 CEST

Update

Some TOPdesk environments were disconnected with their database for too long. These environments will show an error message saying 'TOPdesk encountered an error during start-up' on the login page.

TOPdesk Support is working to restart all affected environments to restore the database connection.

Posted Sep 25, 2020 - 17:02 CEST

Update

Our hosting provider resolved the network issues in the NL3 hosting location. Most of the affected TOPdesk environments are currently back online.

We're still working to restore all services in the NL3 datacenter and to verify that all environments work as expected.

Posted Sep 25, 2020 - 16:40 CEST

Update

There are connectivity issues within the NL3 hosting location. Most TOPdesk environments in the NL3 hosting location are unreachable.

We're working with the hosting provider to resolve this issue with the highest priority.

Posted Sep 25, 2020 - 16:08 CEST

Investigating

We are currently experiencing problems on our NL3 hosting location. We are aware of the problem and working on a solution.

Our apologies for the inconvenience. We aim to update this status page every 30 minutes until the issue has been resolved.

E-mail updates will be sent when the issue has been resolved. You can subscribe on the status page (https://status.topdesk.com) for additional updates.

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR20 09 8356.

Posted Sep 25, 2020 - 15:59 CEST

This incident affected: NL3 SaaS hosting location.