Timeline
- 11:20 TOPdesk Support receives phone calls about TOPdesk being slow or not reachable.
- 11:21 The SaaS team found an issue with the authentication service.
- 11:25 Mitigated the issue by up-scaling parts of the authentication service.
Root cause:
Due to high load on NL3 and a lack of connections in the authentication service new connections to various parts became very unreliable. The system attempted automated restarts for parts of the service, this succeeded however these parts could not immediately be used for new connections. The parts which did not come back up overloaded the rest of the system which caused the problems on the user side.
Points of improvement
We want to be alerted when critical parts crash or when health/readiness endpoints fail. Investigation afterwards learned us that there was already something in the authentication service monitoring earlier that morning; automated restarts fixed these problems before the monitoring showed them. An internal incident has been created to adjust the monitoring.