The disruption began on November 4 with environments hosted in the NL4 data center being unreachable. Our team promptly responded by initiating an internal conference call and publishing the disruption on our status page to keep our customers informed.
First investigation of the issue pointed towards the hosting location itself being unreachable and we contacted our hosting provider, who informed us that the primary edge firewall had been accidentally restarted during routine maintenance work on the secondary firewall. This restart caused all NL4 instances to become unreachable because none of the firewalls were online. Our hosting provider immediately began working on a resolution of the firewall issue and as soon as the hosting location was reachable again, a partial restoration of services was achieved with key servers and environments becoming reachable again. Our team monitored the situation closely and confirmed that the services were stable. We updated our status page to reflect the monitoring status and later confirmed the resolution of the issue.
On the following day, additional issues related to the NL4 disruption were reported, including slowness in sending emails and certain functionalities within TOPdesk not working correctly. These issues were linked to the previous day's firewall incident, as a VPN configuration had been lost during the accidental restart. Our hosting provider re-applied the correct configuration, which resolved the remaining issues. Although we did not observe delays longer than two seconds in the logging of our SMTP servers, our engineers decided to reboot the SMTP servers to ensure they were not contributing to the slow email sending issue.
By November 6, we received confirmation from our customers that everything was working as expected and all issues had been resolved.
The cause of the disruption was a combination of maintenance on the secondary firewall and the accidental restart of the primary firewall. Under normal circumstances, an accidental restart of the primary firewall would not have caused issues due to the presence of a secondary firewall. However, in this instance, the secondary firewall was also down due to maintenance purposes. As a result, there was no fallback available. Subsequent issues were related to a lost VPN configuration, which was corrected the following day.
In response to the incident, we took immediate actions to restore services and communicate with our customers. Our team worked closely with our hosting provider to investigate and resolve the issue. We provided timely updates on our status page and confirmed the restoration of services with affected customers. Additionally, our engineers proactively rebooted the SMTP servers to eliminate any potential causes of the slow email sending issue.
The combination of both firewalls being offline is a rare case, as there will always be at least one firewall available under normal circumstances. While we cannot completely eliminate the potential for human error, we are taking several measures to mitigate its impact and enhance the robustness of our systems. To address this, we have created a backlog incident to explore ways to make our software more robust and less dependent on VPN configurations. Solving this issue will help ensure that functionalities continue to work properly even in the event of an accidental firewall reboot.
We sincerely apologize for the inconvenience this disruption caused. We are dedicated to learning from this incident and improving our processes to ensure a more reliable service for our customers.