Root cause analysis (RCA) for major TDR20 09 6276.
Starting on September 21st, we had problems on our SaaS network which involved slowness while working in TOPdesk, short periods of unavailability, and the Reporting Service (Duration reports and Odata feed) being disabled for nearly two days. This RCA aims to inform everyone affected by these problems regarding what went wrong, what we did to mitigate the problems, the lessons learned, and the actions set out to prevent future issues.
Timeline and technical details (all times are in CEST):
2020-09-21
09:00 We received the first report of a customer at our NL3 hosting location experiencing slowness while working in their TOPdesk environment.
09:21 An increase in the number of incidents classified as performance being logged triggered an automated major incident process to log and investigate these issues. Our Technical Support and SaaS Operations team began an investigation.
09:24 The SaaS Operations team investigated the issue and was able to locate the cause of issue to an unexpected high load on Kubernetes nodes, running several customer facing services. The affected services included the authentication service (passlayer) which handles all customer traffic.
Frequently during moments of high customer traffic services were competing for resources, causing customer traffic to be delayed and causing slowness in response to customer input. Engineers were investigating the origin of services using more resources than normal.
09:43 Services were moved to different Kubernetes nodes, and the problems for customers appear to be resolved. The internal major incident is closed.
11:28 A new automated major is created. The Support tech team and the SaaS Operations team continue their investigation. The team determined the issues were different from earlier that day and started investigating other possible common causes.
16:20 The performance issues were published on the Status page.
2020-09-22
09:30 The SaaS Operations teams continues to investigate the high CPU load on the Kubernetes cluster. The high CPU load seems to be caused by the reporting service and the audit trail service which are using significantly more CPU than expected. Meetings are started with Operations and Development to determine the cause of the high CPU load.
14:40 The responsible Development team and the SaaS Operations team notice that the passlayer service (a key service for the proper functioning of the environment) is unavailable. The Operations teams restarts the pods to resolve the issue.
14:50 The Operations team disables the reporting service and the audit trail service, reducing load on the systems and decreasing the impact on performance for customers. Though the performance issues were alleviated, engineers continue to investigate the underlying issue.
15:05 The Operations team enables the audit trail service again
16:00 The monitoring indicates that the load is good without the reporting service. The team will continue to monitor the load/performance of Kubernetes nodes on Wednesday morning.
2020-09-23 08:30 The investigation by Operations continues. 12:00 Operations discovers that the underlying infrastructure managed by our hosting partners was reporting resource constraints on the physical hardware. This causes the machines to have insufficient resources to handle the required traffic. Engineers made changes to machine layout and reached out to our hosting partner to discuss impact of potential configuration changes. 13:00 Operations determines this to be the root cause of the issue and reach out to the hosting partner Engineering team for a discussion on implementing reservations. 21:00 The Operations team re-configures the virtual machines for a better fit between virtual hosts and physical CPU sockets.
2020-09-24 09:00 The Operations team implements resource reservations for machines providing compute resources to services after advice from engineers from our hosting partner. This reservation mitigated the observed starvation causing the latency. 11:00 In the afternoon engineers implemented resource limits on certain services preventing them from overloading machines at peak time. The reporting service was once again made available in the afternoon to our customers. 13:00 The SaaS operations team continued to observe the performance of services during the day and the changes seem to have a positive effect. 13:30 CPU limits for the services are set and the Reporting service is brought back online.
2020-09-25
09:00 The Operations teams monitors and review the metrics; they are all within the expected limits.
2020-09-28
18:00 After discussing with all involved teams, the major incident is closed.
Root cause
Due to a combination of high customer traffic, reporting services being used more extensively than before, and a configuration limitation in our hosting platform; key services were starved of computing resources causing them to be unable to keep up with customer traffic.
Follow up actions
We realize that this disruption included problems with the Odata feed functionality which for a number of TOPdesk users is a vital part of the process. Since TOPdesk is moving more towards a SaaS solution with separated services, we aim to communicate in a more proactive way when it comes to the unavailability of certain services.
Development:
Some of the services did not meet the requirements set by our software architects, which hindered troubleshooting. Development will be made aware that all services are required to meet these standards before being used in production. Follow-up checks to verify all services have been adjusted accordingly have also been scheduled.
Our Development department started an investigation to see if Development teams can have a more proactive role in troubleshooting problems in a specific service.
We started to investigate the option of rate limiting or queuing for non-critical services.
SaaS Support:
The communication strategy has been reviewed with representatives from several departments. A project has been started to review all communication procedures during a major incident.
SaaS Operations:
We've improved our monitoring to better detect resource starvation issues.
We're working to create resource reservations for system-critical virtual machines and all services.
Design choices of the virtual machines and physical hosts will be evaluated with our hosting provider to ensure they match the infrastructure that is used.
A project to reduce the size (amount of customers) of the NL3 hosting location has been scheduled. This will limit the impact of future disruptions.
Together with Development load tests on services are investigated to improve the stability of the key services.