SaaS disruption

Incident Report for TOPdesk SaaS Status page

Postmortem

On February the 25th 2020, between 8:00 GMT and 9:09 GMT (9:00 and 10:09 CET) several environments in the UK hosting location showed error pages when customers tried to connect.

The issue was related to a routing issue at a peering partner between our hosting provider and our Content Delivery Network (CDN) provider.

‌

When the engineer on duty was alerted by monitoring systems the issue could quickly be pinpointed to connectivity issues between the TOPdesk edge systems that accept traffic from our CDN provider.

‌

To secure traffic in-transit between our CDN and our hosting location we use a tunneling system that secures this connection. The tunneling systems creates several site-to-site tunnels per environment to two hosting locations of our CDN provider. We run several instances which allow for greater fault tolerance and minimal impact in case of a disruption with one of the systems. When a tunnel disconnects, the traffic is rerouted to the other tunnels and the original tunnel will try to reconnect.

‌

In the morning of February the 25th we discovered that all the tunnels were disconnected on all of our edge servers and that new tunnels that connected were quickly disconnecting again. Tunnels were usually connecting quick enough for our monitoring to detect the environment was online again, causing it to be incorrectly flagged as healthy on the status monitoring and for the uptime reporting.

‌

Our engineering team reached out to the support team of our CDN provider who report there were no issues on their side but did see connectivity issues to the TOPdesk edge servers in the UK hosting location. Engineers decided to bypass the tunneling setup and allow our CDN to directly connect to our edge servers over https, this bypassed the issue restoring stable connectivity for our customers.

‌

Later investigation showed that the cause of the connectivity was neither caused by our hosting provider or our CDN provider but a peering partner that was dropping tcp packets on the line between both sites. The packetloss caused the tunnel to be marked as unhealthy and caused it to restart over and over again.

‌

Follow-up actions:

Our engineering team is in contact with our partners to improve reliability and performance between both locations. Significant routing changes will be to be made to prevent this from happening again. In the mean time TOPdesk will leave the bypass of the tunneling system in-place to prevent an outage from reoccurring.

‌

We will plan to improve our monitoring system to detect connectivity issues earlier once we re-implement the system.

Posted Mar 06, 2020 - 17:37 CET

Resolved

The bypass is stable, but we'll continue monitoring the connection until our CDN provider has solved the root cause.

Posted Feb 25, 2020 - 10:28 CET

Monitoring

The bypass is operational and should mitigate the connectivity issues. We're still in touch with our CDN provider which is working on solving the root cause.

Posted Feb 25, 2020 - 10:09 CET

Update

Our CDN provider is still investigating the issue. Our engineer team is currently working on creating a bypass of the service which is causing the connectivity issues to mitigate the issue.

Posted Feb 25, 2020 - 09:55 CET

Update

We are still working on resolving this issue

Posted Feb 25, 2020 - 09:53 CET

Update

Our engineering team has found the cause of the issue to originate from connectivity issues to our CDN provider. We are reaching out to our CDN provider to resolve the issue.

Posted Feb 25, 2020 - 09:32 CET

Investigating

We are currently experiencing connectivity problems on our UK1 hosting location . As a result your TOPdesk environment may not be available.

We are aware of the problem and working on a solution.

Our apologies for the inconvenience. We aim to update this status page at least every 30 minutes until the issue has been resolved.

If you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ to indicate you are affected. Please refer to incident 20 02 7575.

Posted Feb 25, 2020 - 09:18 CET

This incident affected: UK1 SaaS hosting location.