Disruption in hosting location UK1

Incident Report for TOPdesk SaaS Status page

Postmortem

Description
On Monday (2021-03-22) and Tuesday (2021-03-23) around 16:30 (GMT) a disruption impacted nearly 200 production environments on the UK1 data center.

Timeline (all times are in GMT)
2021-03-22
16:36 The first unusual results appear in the monitoring.
16:50 One of our SaaS Operators identifies an unresponsive demo environment.
16:55 Multiple phone calls and incidents logged from reporting unresponsive and / or inaccessible TOPdesk environments.
16:59 Operations team contacted and the Major incident process / investigation begins.
17:00 A major incident is automatically created and published to My.TOPdesk.
17:00 A conference call is started. Databases are being failed over to their secondary node. At this stage the suspicion is that the issue relates to the physical hardware hosting the VM's.
17:19 Investigation continues; the virtual database servers are moved to a different host.
17:28 The issues appear to be alleviated; environments can be used again, monitoring returns to with normal parameters and customers no longer reporting issues.
18:00 The major ticket is updated. The situation is stable but not in a desirable state yet.

2021-03-23
08:24 One of our Development team reports that the databases are experiencing problems again and are failing over to their second node. The Operations team restarts the investigation.
09:00 Customers report problems.
09:09 SaaS Operations force the failover of the affected databases.
09:36 The failovers are complete.
10:06 Conference call started with the Support tech team to investigate running queries.
10:30 A single environment causing a high load on the database servers is restarted.
10:51 The environments are recovering and working again

Root cause
The SaaS Operations team has been in contact with the technicians at the UK1 data center. It appeared that a resource limit was reached for the machines in the data center. This was aggravated by an overload caused by API requests to a specific environment.

Infrastructural improvements
Monitoring on the mentioned resource limit is being implemented
Rate limiting on the front end of our SaaS infrastructure is being investigated
New resources have been ordered for the current setup on the UK1 data center
A long term plan has started to improve investigation on the database servers

Software improvements
Rate limiting for API calls is being investigated
Monitoring for the API usage is being investigated

Communication improvements
We promise to update the status page every 30 minutes, this promise was not kept. We will either adjust this promise or the update frequency.

If you have questions regarding this major please contact your account manager or create a ticket for your support department.

Posted Apr 07, 2021 - 15:21 CEST

Resolved

On Monday March 22nd at approximately 16:30 GMT / 17:30 CET, our monitoring detected two overloaded database servers and our Major incident process was started. This was followed shortly thereafter by a number of reports from customers of poor performance or inaccessibility of their TOPdesk system.

Our SaaS Operations team started an investigation and we moved customers to backup servers to alleviate the problem. At this time the cause was believed to be related to a resource issue on the physical machine.

On Tuesday March 23rd at approximately 09:00 GMT / 10:00 CET the behaviour we saw the previous day returned and we continued our investigation. Our investigation found that a faulty / poorly configured API setup caused the initial overload of one of our database servers, which in turn affected the second database server. We adjusted this setup and were able to confirm that the overloading was no longer taking place and the issue was resolved just after 10:30 GMT / 11:30 CET. This update was published to our status page once we were able to confirm this.

Moving forward with this issue, our major incident team will begin an evaluation process which will examine the information relating to this case, including why the standard safeguards that we have in place for resource reservation did not prevent this issue as expected. We will also research how we can prevent a faulty configuration to overload our database servers in future.

If you have any further questions or require any further information on this issue please contact Support, or your TOPdesk Account Manager and we will be happy to help.

Posted Mar 24, 2021 - 13:56 CET

Update

We have continued to work on the overloaded database server and a possible cause has been identified.

As a temporary fix we have restarted the component which overloaded the database server and the load has subsequently been reduced.

We are monitoring the situation closely and will update the major and the status page with new actions or findings. Please keep updating your ticket if you are experiencing problems.

Posted Mar 23, 2021 - 12:20 CET

Update

This morning at 10:00 CET (09:00 GMT) the database servers were overloaded.

As soon as this happened we moved the databases to a secondary server.

At this moment one of the database servers is still overloaded, we are working on getting this to a stable situation.

Posted Mar 23, 2021 - 10:47 CET

Monitoring

We have identified a server with a high load that caused the database servers to become unresponsive.

We have performed a failover to alleviate the situation and the TOPdesk environments should be working again.

In case you still experience problems, please update your ticket.

We are continuing to investigate the cause for this unexpected high load situation.

Posted Mar 22, 2021 - 19:15 CET

Investigating

We are currently experiencing problems on the UK1 hosting location. As a result your TOPdesk environment may not be available.

One of the database servers has become unresponsive, we are looking into this.

Our apologies for the inconvenience. We aim to update this status page every 30 minutes until the issue has been resolved.

E-mail updates will be sent when the issue has been resolved. You can subscribe on the status page (https://status.topdesk.com) for additional updates.

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR21 03 7652.

Posted Mar 22, 2021 - 18:03 CET

This incident affected: UK1 SaaS hosting location.