RESOLVED: SaaS disruption NL4 environments unreachable

Incident Report for TOPdesk SaaS Status page

Postmortem

Incident Details

The disruption began on November 4 with environments hosted in the NL4 data center being unreachable. Our team promptly responded by initiating an internal conference call and publishing the disruption on our status page to keep our customers informed.
First investigation of the issue pointed towards the hosting location itself being unreachable and we contacted our hosting provider, who informed us that the primary edge firewall had been accidentally restarted during routine maintenance work on the secondary firewall. This restart caused all NL4 instances to become unreachable because none of the firewalls were online. Our hosting provider immediately began working on a resolution of the firewall issue and as soon as the hosting location was reachable again, a partial restoration of services was achieved with key servers and environments becoming reachable again. Our team monitored the situation closely and confirmed that the services were stable. We updated our status page to reflect the monitoring status and later confirmed the resolution of the issue.

On the following day, additional issues related to the NL4 disruption were reported, including slowness in sending emails and certain functionalities within TOPdesk not working correctly. These issues were linked to the previous day's firewall incident, as a VPN configuration had been lost during the accidental restart. Our hosting provider re-applied the correct configuration, which resolved the remaining issues. Although we did not observe delays longer than two seconds in the logging of our SMTP servers, our engineers decided to reboot the SMTP servers to ensure they were not contributing to the slow email sending issue.

By November 6, we received confirmation from our customers that everything was working as expected and all issues had been resolved.

Root Cause

The cause of the disruption was a combination of maintenance on the secondary firewall and the accidental restart of the primary firewall. Under normal circumstances, an accidental restart of the primary firewall would not have caused issues due to the presence of a secondary firewall. However, in this instance, the secondary firewall was also down due to maintenance purposes. As a result, there was no fallback available. Subsequent issues were related to a lost VPN configuration, which was corrected the following day.

Actions Taken

In response to the incident, we took immediate actions to restore services and communicate with our customers. Our team worked closely with our hosting provider to investigate and resolve the issue. We provided timely updates on our status page and confirmed the restoration of services with affected customers. Additionally, our engineers proactively rebooted the SMTP servers to eliminate any potential causes of the slow email sending issue.

Next Steps

The combination of both firewalls being offline is a rare case, as there will always be at least one firewall available under normal circumstances. While we cannot completely eliminate the potential for human error, we are taking several measures to mitigate its impact and enhance the robustness of our systems. To address this, we have created a backlog incident to explore ways to make our software more robust and less dependent on VPN configurations. Solving this issue will help ensure that functionalities continue to work properly even in the event of an accidental firewall reboot.

We sincerely apologize for the inconvenience this disruption caused. We are dedicated to learning from this incident and improving our processes to ensure a more reliable service for our customers.

Posted Nov 15, 2024 - 15:27 CET

Resolved

After closely monitoring the situation, we are happy to report that everything is stable, and our customers have confirmed that TOPdesk is working as expected once again.

We will now close this major incident. An internal evaluation will be conducted soon, and we will share more details about the root cause in an upcoming post-mortem report.

If you have any questions, please feel free to reach out to our support team. We sincerely appreciate your patience and understanding during this time.

Posted Nov 04, 2024 - 15:34 CET

Monitoring

We have reached out to our service provider, Leaseweb, and they responded promptly. The issue with unreachable instances appears to be resolved now. We will continue to monitor the situation closely for a while to ensure everything remains stable. Thank you for your patience and understanding.

Posted Nov 04, 2024 - 15:18 CET

Investigating

We are currently experiencing problems on NL4. As a result your TOPdesk environment may not be available.

We are aware of the problem and are working on a solution.

Our apologies for the inconvenience. At the time of writing this we are not able to give you an estimate on when your environment will be available. We aim to update this status page every 30 minutes until the issue has been resolved.

E-mail updates will be sent when the issue has been resolved. You can subscribe on the status page (https://status.topdesk.com) for additional updates.

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR24 11 676.

Posted Nov 04, 2024 - 15:00 CET

This incident affected: NL4 SaaS hosting location.