Description
On Monday (2021-03-22) and Tuesday (2021-03-23) around 16:30 (GMT) a disruption impacted nearly 200 production environments on the UK1 data center.
Timeline (all times are in GMT)
2021-03-22
16:36 The first unusual results appear in the monitoring.
16:50 One of our SaaS Operators identifies an unresponsive demo environment.
16:55 Multiple phone calls and incidents logged from reporting unresponsive and / or inaccessible TOPdesk environments.
16:59 Operations team contacted and the Major incident process / investigation begins.
17:00 A major incident is automatically created and published to My.TOPdesk.
17:00 A conference call is started. Databases are being failed over to their secondary node. At this stage the suspicion is that the issue relates to the physical hardware hosting the VM's.
17:19 Investigation continues; the virtual database servers are moved to a different host.
17:28 The issues appear to be alleviated; environments can be used again, monitoring returns to with normal parameters and customers no longer reporting issues.
18:00 The major ticket is updated. The situation is stable but not in a desirable state yet.
2021-03-23
08:24 One of our Development team reports that the databases are experiencing problems again and are failing over to their second node. The Operations team restarts the investigation.
09:00 Customers report problems.
09:09 SaaS Operations force the failover of the affected databases.
09:36 The failovers are complete.
10:06 Conference call started with the Support tech team to investigate running queries.
10:30 A single environment causing a high load on the database servers is restarted.
10:51 The environments are recovering and working again
Root cause
The SaaS Operations team has been in contact with the technicians at the UK1 data center. It appeared that a resource limit was reached for the machines in the data center. This was aggravated by an overload caused by API requests to a specific environment.
Infrastructural improvements
Monitoring on the mentioned resource limit is being implemented
Rate limiting on the front end of our SaaS infrastructure is being investigated
New resources have been ordered for the current setup on the UK1 data center
A long term plan has started to improve investigation on the database servers
Software improvements
Rate limiting for API calls is being investigated
Monitoring for the API usage is being investigated
Communication improvements
We promise to update the status page every 30 minutes, this promise was not kept. We will either adjust this promise or the update frequency.
If you have questions regarding this major please contact your account manager or create a ticket for your support department.