Disruption in NL3 hosting location

Incident Report for TOPdesk SaaS Status page

Postmortem

On Sunday June 28th at 19:15 (07:15 PM) CEST our monitoring system alerted the standby operator that several servers in the NL3 hosting location were unavailable. The standby operator investigated the issue and noticed that the servers were unreachable, and datacenter management tools were not working.

The standby engineer contacted the hosting provider to continue troubleshooting this issue, and the hosting provider confirmed the issue. The hosting provider started investigating the issue with the highest priority, and promptly escalated the problem when it could not be resolved soon. Additional engineers were called in, and after a long troubleshooting session the issue appeared to be fully resolved at 02:00 (AM) on Monday.

At 07:25 (AM) on Monday one of our engineers noticed that a part of the infrastructure in the NL3 hosting location had not recovered fully. The issue was resolved in a few minutes, meaning some features in TOPdesk could have malfunctioned until 7:30 AM.

Root cause

A port on the storage array used to host databases and files for TOPdesk SaaS environments was malfunctioning. This issue was hard to detect, as the hosting provider used the same storage array for it’s management systems and monitoring system.

Action points

The hosting provider will make sure the monitoring system for the storage array uses a different storage system. A separate investigation is started to find the root cause of the failing monitoring system.

TOPdesk has identified some points of improvement in the communication towards customers and with the hosting provider. We’ll follow up on the internal communication improvements, and will contact the hosting provider to further improve our collaboration during disruptions that affect our customers.

Update 13-07-2020

To further troubleshoot the problem, all relevant cables and network cards on the machine have been replaced by the hosting provider. The team also tested assigning a machine different ports on the switch, and different switchports on the storage.

As there was some network congestion detected, the uplink capacity will be expanded to prevent future issues.

The hosting provider has added alert rules on its monitoring system so their standby operators are immediately informed when storage is unreachable.

Posted Jul 08, 2020 - 14:20 CEST

Resolved

All environments are running and accessible. If any issues occur please contact our support department.

Posted Jun 29, 2020 - 02:32 CEST

Update

Our hosting provider reported to have found a possible root cause. We will be check all environment and make sure they are all accessible again. A root cause analysis will also follow.

Posted Jun 29, 2020 - 01:36 CEST

Update

Our hosting provider is still investigating with highest priority. We will decrease the amount of updates since nothing mention-able is happening. Do note that we and our hosting provider are working non stop to resolve this problem.

Posted Jun 29, 2020 - 01:06 CEST

Update

Our hosting provider is still investigating the problem with highest priority.

Posted Jun 28, 2020 - 22:41 CEST

Update

Our hosting provider is still investigating and working on a solution. When more detailed information is available we will post it here.

Posted Jun 28, 2020 - 21:36 CEST

Update

Our hosting provider updated us that this outage still has the highest priority and they are working on a solution.

Posted Jun 28, 2020 - 21:05 CEST

Update

Our hosting provider has been contacted and acknowledged there is a partial outing. They are working on a resolution.

Posted Jun 28, 2020 - 20:35 CEST

Investigating

We are currently experiencing problems on the NL3 hosting location. As a result your TOPdesk environment may not be available.

We are aware of the problem and working on a solution.

Our apologies for the inconvenience. The current status can be found on our TOPdesk Status Page: https://status.topdesk.com

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident 20 06 7893.

Posted Jun 28, 2020 - 19:34 CEST

This incident affected: NL3 SaaS hosting location.