Disruption in hosting location NL3

Incident Report for TOPdesk SaaS Status page

Postmortem

Time line

Sunday March 21st maintenance was executed to update management tooling on the machines hosting our virtual servers. During this maintenance servers are moved to other machines to prevent impact to customer environments. As TOPdesk environments aren’t used much on Sundays this can leave the network unbalanced as servers that are used a lot during workdays may end up on the same machine during the maintenance.

‌

On Monday March 22nd several reports came in early in the morning that TOPdesk environments on the NL3 hosting location were slow to respond. We noticed similar delays in the monitoring on incident save times for NL3, but didn’t see other problems in our monitoring system that pointed to a root cause.

One of the proxy servers in the NL3 hosting location had a high CPU load compared to other proxy servers. The CPU load was below our warning limits, but still noticeably higher. This proxy server was taken out of the pool of active servers, restarted, and added back to the pool. This resolved the performance issues for some customers, but for others the issue remained.

We noticed the Distributed Resource Scheduler (DRS), which moves virtual servers to the machine with the best match in available resources, was moving several servers around all morning. This was expected after Sunday’s maintenance, but was taking much longer than usual.

After further investigation it turned out that one of the firewall servers was not receiving the amount of resources it needed to complete it’s operations. Normally the DRS system should take care of this and move the server to a different machine with sufficient resources, but now it didn’t. Logs on the firewall server resource allocation showed that firewall resource issues correlated with customer reports of slow performance and incident save time anomalies.

Our SaaS hosting engineers tested if additional resources could be assigned to the firewall server without impacting the availability of customer environments, updated the resources, and added a resource reservation to ensure the firewall would receive sufficient resources in the future.

‌

Between March 23rd and March 28th we received very few reports of performance problems on the NL3 hosting location, and those we did receive were found to be unrelated to this problem.

While continuing our investigation into the DRS issues, we noticed there was very little spare capacity on the NL3 hosting location despite a reduction in customers hosted here. Capacity planning sessions are regularly conducted, but since the number of clients hosted in this location is decreasing it was assumed no additional resources would be needed. We contacted the hosting provider to order additional resources and to schedule an evaluation of the DRS settings.

‌

On Monday March 29th several customers experienced a short period of slowness while working in TOPdesk as several SQL servers received insufficient resources. Moving the servers to another machine resolved the problem. We escalated the existing issues with our hosting provider to expedite the new resources. The resources have been ordered and are expected to be delivered soon.

‌

Root Cause

‌Due to a faster than expected growth of resources used by existing customers a resource starvation issue occurred, causing customers to experience slowness while working in TOPdesk. Additional resources have been ordered, and we’ll improve our capacity planning system.

‌

Follow-up actions

We’ve ordered additional resources for the NL3 hosting location and expect these to be delivered soon.

Resource reservations have been added to all firewall servers. A project to update the resource reservations for other servers has also been scheduled.

A meeting with engineers at the hosting provider has been scheduled. We’ll evaluate our current DRS settings to see if the imbalance after weekend updates can be prevented, and if additional alerts would help spotting performance problems sooner. After an update of the management tooling at the hosting provider we’ll also start implementing alerts on resource starvation.

Next to our monitoring on incident save times for individual environments, we’ll work on monitoring and alerting on average incident save times for groups of customers to faster spot performance issues.

We’re improving our disaster recovery plans with more steps regarding troubleshooting performance issues. We’re also updating our capacity planning procedures to better incorporate the forecasted growth at current customers and the deployment of new services.

Posted Apr 08, 2021 - 14:12 CEST

Resolved

We found a likely culprit for the performance issues that customers have been experiencing. A firewall was not receiving the assigned resources, and the firewall logs show problems correlating to times where customer performance was not as expected.

A temporary workaround has been implemented by moving several machines to another server. Resource reservations will be added later today to prevent this problem from occurring again. We're scheduling additional meetings with the hosting provider to verify our capacity planning against observed and forecasted growth, and we'll keep monitoring the performance of TOPdesk environment to verify the issue is completely resolved.

If you still experience slowness while working in TOPdesk, please log an incident in My TOPdesk and let us know which actions are slow, and at what times the slowness impacted your operations.

Posted Mar 22, 2021 - 16:22 CET

Update

There's still no clear indication of what's causing the performance issues. We're testing a few hypotheses and testing individual network components to find the root cause.

Posted Mar 22, 2021 - 14:38 CET

Update

Several customers reported slowness after the proxy server adjustments at 10:40. We're still working to investigate and resolve this issue.

Posted Mar 22, 2021 - 11:59 CET

Update

We identified a possible cause for the performance problems. One of the proxy servers in the NL3 hosting location wasn't performing as expected and has been restarted at 10:40 CET.

If you've already logged an incident, please let us know if you're still experiencing slowness since 10:40 by updating your incident in My TOPdesk.

We'll continue to investigate the root cause of this problem, and will monitor the situation to verify that the performance is back to normal.

Posted Mar 22, 2021 - 10:59 CET

Update

Updated the status to indicate that environments are available, but slow to respond.

Posted Mar 22, 2021 - 10:32 CET

Investigating

We are currently experiencing problems on the NL3 hosting location. As a result your TOPdesk environment may not be available.

We are aware of the problem and working on a solution.

Our apologies for the inconvenience. We aim to update this status page every 30 minutes until the issue has been resolved.

E-mail updates will be sent when the issue has been resolved. You can subscribe on the status page (https://status.topdesk.com) for additional updates.

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR21 03 7251.

Posted Mar 22, 2021 - 10:16 CET

This incident affected: NL3 SaaS hosting location.