Time line
Sunday March 21st maintenance was executed to update management tooling on the machines hosting our virtual servers. During this maintenance servers are moved to other machines to prevent impact to customer environments. As TOPdesk environments aren’t used much on Sundays this can leave the network unbalanced as servers that are used a lot during workdays may end up on the same machine during the maintenance.
On Monday March 22nd several reports came in early in the morning that TOPdesk environments on the NL3 hosting location were slow to respond. We noticed similar delays in the monitoring on incident save times for NL3, but didn’t see other problems in our monitoring system that pointed to a root cause.
One of the proxy servers in the NL3 hosting location had a high CPU load compared to other proxy servers. The CPU load was below our warning limits, but still noticeably higher. This proxy server was taken out of the pool of active servers, restarted, and added back to the pool. This resolved the performance issues for some customers, but for others the issue remained.
We noticed the Distributed Resource Scheduler (DRS), which moves virtual servers to the machine with the best match in available resources, was moving several servers around all morning. This was expected after Sunday’s maintenance, but was taking much longer than usual.
After further investigation it turned out that one of the firewall servers was not receiving the amount of resources it needed to complete it’s operations. Normally the DRS system should take care of this and move the server to a different machine with sufficient resources, but now it didn’t. Logs on the firewall server resource allocation showed that firewall resource issues correlated with customer reports of slow performance and incident save time anomalies.
Our SaaS hosting engineers tested if additional resources could be assigned to the firewall server without impacting the availability of customer environments, updated the resources, and added a resource reservation to ensure the firewall would receive sufficient resources in the future.
Between March 23rd and March 28th we received very few reports of performance problems on the NL3 hosting location, and those we did receive were found to be unrelated to this problem.
While continuing our investigation into the DRS issues, we noticed there was very little spare capacity on the NL3 hosting location despite a reduction in customers hosted here. Capacity planning sessions are regularly conducted, but since the number of clients hosted in this location is decreasing it was assumed no additional resources would be needed. We contacted the hosting provider to order additional resources and to schedule an evaluation of the DRS settings.
On Monday March 29th several customers experienced a short period of slowness while working in TOPdesk as several SQL servers received insufficient resources. Moving the servers to another machine resolved the problem. We escalated the existing issues with our hosting provider to expedite the new resources. The resources have been ordered and are expected to be delivered soon.
Root Cause
Due to a faster than expected growth of resources used by existing customers a resource starvation issue occurred, causing customers to experience slowness while working in TOPdesk. Additional resources have been ordered, and we’ll improve our capacity planning system.
Follow-up actions
We’ve ordered additional resources for the NL3 hosting location and expect these to be delivered soon.
Resource reservations have been added to all firewall servers. A project to update the resource reservations for other servers has also been scheduled.
A meeting with engineers at the hosting provider has been scheduled. We’ll evaluate our current DRS settings to see if the imbalance after weekend updates can be prevented, and if additional alerts would help spotting performance problems sooner. After an update of the management tooling at the hosting provider we’ll also start implementing alerts on resource starvation.
Next to our monitoring on incident save times for individual environments, we’ll work on monitoring and alerting on average incident save times for groups of customers to faster spot performance issues.
We’re improving our disaster recovery plans with more steps regarding troubleshooting performance issues. We’re also updating our capacity planning procedures to better incorporate the forecasted growth at current customers and the deployment of new services.