ROOT CAUSE ANALYSIS (RCA)
Disruption in NL3 Hosting Location Impacting Action Sequences
Date of the incident
First symptoms: 21.03.2024 - Major TDR24 03 7280
Second round of symptoms: 26.03.2024 - Major TDR24 03 9021
What happened
On our NL3 hosting location, we experienced issues with respect to the performance of the messaging system. These issues resulted in disruptions on the RabbitMQ, which mainly caused issues for the execution in the action sequences.
Why it happened
The message-queues of TOPdesk instances in NL3 hosting location got into a state where they were multiplying their connections to the messaging server (from here on RabbitMQ). How the TOPdesk instances got into that state was unclear, and we were unable to get them in that state by force. However, the instances kept increasing the amount of connections. This initially caused RabbitMQ to run out of page files/file descriptors, which in turn caused the system to run out of memory and nodes to start to fall over. When this was increased, TOPdesk which got in an erroneous state, started to indefinitely try and recreate connections, which again in turn caused the RabbitMQ to run out of memory.
Detailed timeline
- 21.03.2024 | RabbitMQ cluster on NL3 crashed for the first time. To get it out of the crash loop, memory was increased.
- 26.03.2024 | The cluster crashed for the second time. To stabilize the system, memory was increased even further.
- 27.03.2024 | The amount of Erlang processes and memory consumption were going up, approaching the max limit. To prevent that from happening, system limits were increased one more time. As the memory consumption and the number of channels were still growing, the cluster broke again later in the night. Memory was increased again. Restarting the instances resulted in a drop in the number of channels and connections.
- 28.03.2024 | Even though the restarts from the previous day improved the situation, the number of channels, connections and Erlang processes were still in steady increase. Due to a hick-up on the load balancer, the traffic dropped temporarily. Monitoring and investigation activities continued.
- 29.03.2024 | Due to the growing number of channels and node sync issues, emergency maintenance was planned for the Easter weekend to minimize the potential customer impact.
- 30.03.2024 | Starting from the early morning, the number of connections was rapidly increasing. As part of the emergency maintenance, two nodes that seemed to be really busy, were taken out of the RabbitMQ cluster and recreated. Following the maintenance, customer environments were restarted.
- 01.04.2024 | To stabilize the number of growing channels and connections, a max amount of channels limit per connection was implemented.
- 02.04.2024 | Not all the message queues were being replicated in all the nodes. Measures were taken to ensure queue replication in all nodes.
- 03.04.2024 | Due to all the measures taken and limits implemented throughout the process, the system was stable again.
The measures we took
To mitigate the symptoms and implement a permanent solution, we gradually took some actions, both during and after the major, while also closely monitoring the system. Below please see an overview of the actions taken:
- Increased the capacity of our messaging system to handle the amount of processes
- Restarted the environments in NL3 hosting location to eliminate the faulty connections within the messaging system
- Implemented a max amount of channel per connection limit
- Implemented monitoring on disk usage, queue replication and system metrics
- As an addition to the already existing alerts, implemented new alerts so that early warning mechanism is more comprehensive
- Improved and centralized the logging so that it is easier to track any potential odd behavior and troubleshoot issues
- Fixed the remaining queue replication issues even though there was no visible impact on the service’s ability to consume messages
- Updated the RabbitMQ version so that the servers run on the most recent version
The following was also adjusted in TOPdesk code by development:
- Improved the messaging system so it properly names their connections. This will help to figure out which instances are causing issues.
- Fixed issues with reconnecting to the messaging system in TOS.
Final remarks
Through the above described infrastructure, development and monitoring/logging improvements, not only the messaging system was brought back to its stable state where it could successfully process messages, but the health of the system has also been improved so that we can avert similar issues from occurring in the future.