On February the 25th 2020, between 8:00 GMT and 9:09 GMT (9:00 and 10:09 CET) several environments in the UK hosting location showed error pages when customers tried to connect.
The issue was related to a routing issue at a peering partner between our hosting provider and our Content Delivery Network (CDN) provider.
When the engineer on duty was alerted by monitoring systems the issue could quickly be pinpointed to connectivity issues between the TOPdesk edge systems that accept traffic from our CDN provider.
To secure traffic in-transit between our CDN and our hosting location we use a tunneling system that secures this connection. The tunneling systems creates several site-to-site tunnels per environment to two hosting locations of our CDN provider. We run several instances which allow for greater fault tolerance and minimal impact in case of a disruption with one of the systems. When a tunnel disconnects, the traffic is rerouted to the other tunnels and the original tunnel will try to reconnect.
In the morning of February the 25th we discovered that all the tunnels were disconnected on all of our edge servers and that new tunnels that connected were quickly disconnecting again. Tunnels were usually connecting quick enough for our monitoring to detect the environment was online again, causing it to be incorrectly flagged as healthy on the status monitoring and for the uptime reporting.
Our engineering team reached out to the support team of our CDN provider who report there were no issues on their side but did see connectivity issues to the TOPdesk edge servers in the UK hosting location. Engineers decided to bypass the tunneling setup and allow our CDN to directly connect to our edge servers over https, this bypassed the issue restoring stable connectivity for our customers.
Later investigation showed that the cause of the connectivity was neither caused by our hosting provider or our CDN provider but a peering partner that was dropping tcp packets on the line between both sites. The packetloss caused the tunnel to be marked as unhealthy and caused it to restart over and over again.
Follow-up actions:
Our engineering team is in contact with our partners to improve reliability and performance between both locations. Significant routing changes will be to be made to prevent this from happening again. In the mean time TOPdesk will leave the bypass of the tunneling system in-place to prevent an outage from reoccurring.
We will plan to improve our monitoring system to detect connectivity issues earlier once we re-implement the system.