RESOLVED: Disruption in hosting location NL3. Action Sequences are not functioning.

Incident Report for TOPdesk SaaS Status page

Postmortem

ROOT CAUSE ANALYSIS (RCA)

Disruption in NL3 Hosting Location Impacting Action Sequences

Date of the incident

First symptoms: 21.03.2024 - Major TDR24 03 7280
Second round of symptoms: 26.03.2024 - Major TDR24 03 9021

What happened

On our NL3 hosting location, we experienced issues with respect to the performance of the messaging system. These issues resulted in disruptions on the RabbitMQ, which mainly caused issues for the execution in the action sequences.

Why it happened

The message-queues of TOPdesk instances in NL3 hosting location got into a state where they were multiplying their connections to the messaging server (from here on RabbitMQ). How the TOPdesk instances got into that state was unclear, and we were unable to get them in that state by force. However, the instances kept increasing the amount of connections. This initially caused RabbitMQ to run out of page files/file descriptors, which in turn caused the system to run out of memory and nodes to start to fall over. When this was increased, TOPdesk which got in an erroneous state, started to indefinitely try and recreate connections, which again in turn caused the RabbitMQ to run out of memory.

Detailed timeline

21.03.2024 | RabbitMQ cluster on NL3 crashed for the first time. To get it out of the crash loop, memory was increased.
26.03.2024 | The cluster crashed for the second time. To stabilize the system, memory was increased even further.
27.03.2024 | The amount of Erlang processes and memory consumption were going up, approaching the max limit. To prevent that from happening, system limits were increased one more time. As the memory consumption and the number of channels were still growing, the cluster broke again later in the night. Memory was increased again. Restarting the instances resulted in a drop in the number of channels and connections.
28.03.2024 | Even though the restarts from the previous day improved the situation, the number of channels, connections and Erlang processes were still in steady increase. Due to a hick-up on the load balancer, the traffic dropped temporarily. Monitoring and investigation activities continued.
29.03.2024 | Due to the growing number of channels and node sync issues, emergency maintenance was planned for the Easter weekend to minimize the potential customer impact.
30.03.2024 | Starting from the early morning, the number of connections was rapidly increasing. As part of the emergency maintenance, two nodes that seemed to be really busy, were taken out of the RabbitMQ cluster and recreated. Following the maintenance, customer environments were restarted.
01.04.2024 | To stabilize the number of growing channels and connections, a max amount of channels limit per connection was implemented.
02.04.2024 | Not all the message queues were being replicated in all the nodes. Measures were taken to ensure queue replication in all nodes.
03.04.2024 | Due to all the measures taken and limits implemented throughout the process, the system was stable again.

The measures we took

To mitigate the symptoms and implement a permanent solution, we gradually took some actions, both during and after the major, while also closely monitoring the system. Below please see an overview of the actions taken:

Increased the capacity of our messaging system to handle the amount of processes
Restarted the environments in NL3 hosting location to eliminate the faulty connections within the messaging system
Implemented a max amount of channel per connection limit
Implemented monitoring on disk usage, queue replication and system metrics
As an addition to the already existing alerts, implemented new alerts so that early warning mechanism is more comprehensive
Improved and centralized the logging so that it is easier to track any potential odd behavior and troubleshoot issues
Fixed the remaining queue replication issues even though there was no visible impact on the service’s ability to consume messages
Updated the RabbitMQ version so that the servers run on the most recent version

The following was also adjusted in TOPdesk code by development:

Improved the messaging system so it properly names their connections. This will help to figure out which instances are causing issues.
Fixed issues with reconnecting to the messaging system in TOS.

Final remarks

Through the above described infrastructure, development and monitoring/logging improvements, not only the messaging system was brought back to its stable state where it could successfully process messages, but the health of the system has also been improved so that we can avert similar issues from occurring in the future.

Posted Jul 12, 2024 - 15:25 CEST

Resolved

We're pleased to inform you that the recent major incident impacting our services is now fully resolved.
The back-end has stabilized and since two days, the load on the machines has been normalized to the values before any issues arose.

All systems are now working optimally and we are confident in the stability of our current operations.

However, we understand the importance of identifying the root cause to prevent potential future disruptions. In light of this, we have assembled a dedicated team of skilled engineers to conduct a thorough investigation into the root cause and further increasing the resilience of the messaging system and improving our monitoring. Because of the complexity of this problem the Root Cause Analysis (RCA) will take some time.

Until we have the full RCA we will give you biweekly updates in your incident.

Your understanding throughout this situation has been greatly appreciated. We apologize for any inconvenience caused.

Posted Apr 04, 2024 - 16:35 CEST

Update

We've been keeping an eye on things throughout the day, and all systems are functioning as expected. All features are also operating normally.

As we continue our investigation into the root cause, we plan to enhance our monitoring measures and have significantly scaled up our systems to prevent any impact on our customers. Even with this new enhanced setup, we've noticed that since the restarts last night, all connections have maintained their stability and the performance load of the messaging system is back to normal, as it was before the first issues emerged on the 21st of March.

We will continue to keep you informed with any further developments. We sincerely apologize for any inconvenience caused and we appreciate your patience during this time.

Posted Apr 03, 2024 - 17:26 CEST

Update

After the restarts we conducted last night, we're pleased to inform you that our messaging system is in a healthy state and all functionalities have returned to normal.

We will continue to maintain a close watch and closely monitor throughout the day.

Please note that the investigation into the root cause is still ongoing.

We aim to provide updates when we have significant new information.

Your patience and understanding during this time are greatly appreciated.

Posted Apr 03, 2024 - 10:40 CEST

Update

We would like to provide an extended update on recent events within our NL3 data center, particularly regarding our messaging system.

This morning, we observed that action sequences were not being processed. Our engineers promptly reset and resolved the issues, and the queue was fully caught up by 12:30 CET.

On Saturday, emergency maintenance was conducted on faulty parts of the messaging system.

On Monday, a change was made to the back-end to increase the resilience of the messaging system.

Today, a configuration issue, unrelated to the action sequence issues, was found and promptly rectified.

However, to fully realize the effects of these implementations, we will be restarting all environments within NL3, which is expected to eliminate most, if not all, current faulty connections within the messaging system.

Despite our messaging system not being in the desired state, it remains stable and operational.

We will continue to actively monitor the messaging system until we are confident that the underlying issue is resolved.

Posted Apr 02, 2024 - 17:05 CEST

Update

The queue within our messaging system has caught up, suggesting that the delay should not be significant.

We are monitoring the situation, while our engineers will continue working on mitigating the symptoms and implementing a permanent solution to address the underlying issue.

If you have any questions or require support, please don't hesitate to reach out to us at https://my.topdesk.com/.

We will continue to monitor the situation closely and provide updates as soon as they are available.

Posted Apr 02, 2024 - 12:19 CEST

Update

Our engineers have implemented changes to our messaging system. We can now confirm that action sequences are once again executing with a delay, while our messaging system catches up with the queue.

Now that the messaging system is functioning properly, our engineers will work on mitigating the symptoms and implementing a permanent solution to address the underlying issue.

If you have any questions or require support, please don't hesitate to reach out to us at https://my.topdesk.com/.

We will continue to monitor the situation closely and provide updates as soon as they are available.

Posted Apr 02, 2024 - 10:44 CEST

Update

We're currently receiving reports of delayed action sequences. Despite the changes made by our engineers, the issues persist, and the system's stability remains a challenge.

Our team is actively working to get the messaging system stable and resolve the underlying issues. We are committed to identifying and addressing the root cause of the problems.

This update marks the resumption of regular communication after the holiday period. We'll keep you informed of any significant developments as we work towards a resolution.

If you have any questions or require support, please don't hesitate to reach out to us at https://my.topdesk.com/.

Thank you for your understanding as we work to maintain the functionality and stability of our messaging system.

Posted Apr 02, 2024 - 09:35 CEST

Update

After our engineers have implemented the planned changes, we found that not all symptoms have been mitigated. We have taken extra measures to ensure that the system stays in a stable state. Due to these measures, performance will remain degraded throughout the weekend. Customers might still experience a delayed execution of action sequences as a result. We will continue our efforts to establish the root cause. Unless the situation changes, this will be the last update in this weekend. We will follow up on Tuesday.

We apologize for the inconvenience and thank you for your continued patience and understanding.

Posted Mar 30, 2024 - 22:43 CET

Update

Our engineers are still busy implementing the changes and expect that it will take an additional 2 hours before they are fully implemented.
This means that performance will be degraded until 20:00 (UTC).

Thank you for your understanding as we work diligently to maintain the functionality and stability of our messaging system.

Posted Mar 30, 2024 - 18:29 CET

Update

Our engineers are still hard at work implementing the changes which are taking longer to implement than expected.
Based on the current progress we expect it to take 3 hours longer than originally plannend meaning that performance will be degraded until 18:00 (UTC).

Posted Mar 30, 2024 - 14:49 CET

Update

While our messaging system's performance remains stable and is operating as expected, the underlying issue persists. To address these ongoing concerns, this weekend, during the public holiday period, we're seizing the opportunity to tackle the issues.
Our standby developers will be actively working on recreating the nodes in our messaging system to flush out these issues.
We will start implementing the changes Saturday 30th of March between 10am-15pm (UTC) we expect degraded performances during these hours.

As it's a weekend and Monday is a public holiday in many countries, we'll be limiting communication on the status page to critical updates only during this period.

Thank you for your understanding as we work diligently to maintain the functionality and stability of our messaging system.

Posted Mar 29, 2024 - 16:41 CET

Update

In our ongoing efforts to maintain the stability of our messaging system on NL3, we conducted a restart for the "Group 1" and "Test" upgrade groups last night. While this action resulted in a positive change with the numbers, it wasn't the outcome we anticipated. Despite this, our messaging system's performance remains stable and is operating as expected.

We will continue to closely monitor our messaging system's performance to ensure minimal disruption and will provide another update later today to share further details on our progress.

If you have any questions or require support, please don't hesitate to reach out to us at https://my.topdesk.com/.

Thank you for your understanding as we work to maintain the functionality and stability of our messaging system.

Posted Mar 29, 2024 - 10:17 CET

Update

In our ongoing efforts to maintain the stability of our messaging system, We are scheduling a restart for the "Group 1" and "Test" upgrade groups during the maintenance window.

This action is aimed at ensuring that our message system maintains its standard functionality.
We will closely monitor all traffic passing through our messaging system to ensure minimal disruption.

If you have any questions or require support, please don't hesitate to reach out to us at https://my.topdesk.com/.

We will continue to monitor the situation closely and provide another update tomorrow morning to provide further details on our progress.

Posted Mar 28, 2024 - 16:54 CET

Update

Our messaging system continues to function, maintaining its standard functionality.

Despite ongoing stability, the symptoms persist. Our team is actively investigating to identify the root cause. While services remain operational, we will keep monitoring the situation to ensure minimal disruption.

Our developers are working to mitigate the symptoms and implement a permanent solution to address the underlying issue.

If you have any questions or require support, please don't hesitate to reach out to us at https://my.topdesk.com/.

We will continue to monitor the situation closely and provide updates as soon as they become available.
Thank you for your patience and understanding.

Posted Mar 28, 2024 - 14:03 CET

Update

As normal operations resumed on our NL3 datacenter last night, our monitoring systems indicated that the messaging component was running into a bottleneck. Our engineers worked through the night, increasing resources where necessary. This has significantly improved the situation, and while the metrics have not fully returned to normal, they are looking healthy.

Please note that almost all NL3 environments underwent a restart last night, some of which were outside the standard maintenance window. This was an essential step to ensure a return to optimal functionality.

While we have managed to alleviate the problematic symptoms for now, our engineers are continuing their investigations to fully resolve the issue. They will continue to closely monitor the situation to ensure that customer experience is not impacted.

In addition, we feel it's important to stress that the recent stability issues occurred only in this particular datacenter. Therefore, we will be performing a thorough Root Cause Analysis in the coming days to determine what caused this behavior and how we can address it to prevent future outages.

We sincerely apologize for any inconvenience this may have caused. If you have any further questions or concerns, please feel free to contact our support department.

Posted Mar 28, 2024 - 09:10 CET

Update

The measurements implemented have provided an indication that the situation is improving, but we are still closely monitoring it as we have not yet observed a full return to normalcy.

The issue at hand has our highest priority and we will continue monitoring this closely and will provide further updates tomorrow morning.

We apologize for any inconvenience this is causing.

Posted Mar 27, 2024 - 16:56 CET

Monitoring

The adjustments implemented have given us an idea that the situation is returning to normal.

We anticipate that things are normalizing, but we are closely monitoring the situation for any further developments.

Customers are encouraged to contact support if they continue to experience issues.

Thank you for your patience and our apologies for the inconvenience this issue has caused.

Posted Mar 27, 2024 - 14:35 CET

Update

The adjustments have not yielded the result we were striving for, leaving the messaging system still affected.

We will continue our investigation into the issue.

The next update will be provided when we have any relevant news on the issue or have resolved it.

Posted Mar 27, 2024 - 12:46 CET

Update

Our engineers have increased the capacity of our messaging system to handle the amount of processes. A restart of the system has been initiated and we're monitoring the results.

We will post another update as soon as possible.

Posted Mar 27, 2024 - 11:50 CET

Update

The measures implemented have not yielded the expected results, and unfortunately, issues with automated actions persist.

We are continuing our investigation and implementing further measures to resolve the issue as soon as possible.

Posted Mar 27, 2024 - 11:14 CET

Update

We are currently investigating the root cause of the message delivery delays.

However, the measures we have implemented appear to have addressed the issue, and we anticipate that things will return to normal soon.

We are now awaiting confirmation of resolution.

Posted Mar 27, 2024 - 10:54 CET

Update

We're currently addressing issues at our NL3 hosting location, impacting action sequences.

Our Technical Support and Development teams are actively investigating the issue and trying to implement measurements to solve the issue.

We apologize for any inconvenience caused.

Further updates will be provided on issue as soon as they become available.

Posted Mar 27, 2024 - 10:32 CET

Investigating

We are currently experiencing problems on the NL3 hosting location. As a result your action sequences might not work.

We are aware of the problem and are working on a solution.

Our apologies for the inconvenience.

At the time of writing this we are not able to give you an estimate on when the problem will be resolved.
We aim to update this status page every 30 minutes until the issue has been resolved.

E-mail updates will be sent when the issue has been resolved. You can subscribe on the status page (https://status.topdesk.com) for additional updates.

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR24 03 9021.

Posted Mar 27, 2024 - 10:04 CET

This incident affected: NL3 SaaS hosting location.