At 14:57 CET on November 3rd 2020 our monitoring system reported a problem in the authentication service used by TOPdesk in the EU1 hosting location. The service runs over multiple ‘pods’, but couldn’t determine which of these pods were available. This caused traffic to be routed to the wrong place, causing customers to experience connection issues while working in TOPdesk.
Between 15:10 and 15:25 all pods hosting the authentication service were restarted in order to resolve the problem. At 15:25 the problem was resolved.
We were unable to determine why some of the pods running the authentication service had become unavailable. Our investigation did identify 3 issues in the management system for the authentication service that could have caused this problem. 1 of these issues has now been resolved, for the other 2 issues a solution is being tested (last updated 27-11-2020). We do not expect this problem to reoccur when all 3 issues have been resolved.
We’re still working to resolve some of the issues identified during our investigation of the root cause of this problem. Next to these issues, we’ve updated our documentation on how to resolve the problem, and are working to add additional metrics to our monitoring system to better detect the situation before it affects the customer experience.