SaaS Action Sequences disruption NL3

Incident Report for TOPdesk SaaS Status page

Postmortem

Introduction:

A couple of the TOPdesk functionalities need to work together and communicate. This communication works through a messaging system, in our case Poschd. Poschd recently experienced an overload resulting in issues with the Action Sequences and Knowledge Base functionalities. The cause of these disruptions was mainly due to adding more TOPdesk functionalities into the Poschd system.

Cause:

The move of the Action Sequences and Knowledge Base functionalities to the Poschd system and the increase of the number of environments on the datacenters added substantial load to Poschd. This, coupled with the existing systems already using Poschd, led to an unmanageable rise in messages needing to be processed by Poschd. As a result, Poschd started to buckle under the increased pressure, eventually leading to a Poschd crash in two of our datacenters.

Complications:

When the messaging service was restarted, several functionalities tried to offload their messages simultaneously, causing another overload. It took a lot of time to figure out how to limit the number of messages that Poschd received which was needed for it to start again successfully.

A contributing factor to the overload of Poschd was due to an increasing number of environments on our datacenters. There was no space to distribute these environments to smaller datacenters due to technical difficulties.

Additional information:

We were working on a new messaging system to replace Poschd. The new messaging service will not run into the same problems as Poschd, it is more robust and up to date. This was an extensive and time-consuming project. At the moment the issues with Poschd started this system was not completely ready to be implemented yet.

Summary:

Our messaging system, Poschd recently experienced an overload. The cause was primarily due to more functionalities that needed to communicate through the messaging system.

The transition of Action Sequences and Knowledge Base functionalities to Poschd and the increasing number of environments on the datacenters added a significant and unexpected extra load on the messaging system. This resulted in a crash of the messaging system at two of our datacenters. The restart of the messaging service led to another overload as multiple

functionalities tried to offload messages at once. The surge in datacenter environments and the lack of space to distribute these environments to smaller datacenters also contributed to the overload. Efforts to create a new messaging system to replace Poschd were underway but not completed.

Of course, as an organization we keep on learning and adapting our processes to better handle and prevent any possible issues in the future.

Follow-up actions:

All environments have been moved to the new messaging service as of November 2023.
Build more and smaller datacenters, so functionalities cannot be overloaded. The first new datacenter is being deployed. We will start distributing environments as soon as the datacenter is ready. Our goal is to complete this project in 2024. The creation of these new datacenters has our highest priority.
A dedicated project group has been created in order to update and further formalize our procedures around incidents where multiple customers are affected. This way customers will be updated in a timely manner and know what we are doing to solve the problem. Also, customers will get updates on the symptoms of the problem.

Posted Dec 04, 2023 - 13:44 CET

Resolved

We would like to provide you with another update about the ongoing issues with Actions Sequences at our NL3 hosting location.

After closely monitoring the performance of the environments that were moved last night to the new messaging system, our engineers will continue with the phased rollout and will move a second subset of environments to this new system starting this upcoming Sunday starting at 22:30h CEST.

Our engineers monitored the performance of the current messaging system today and no issues arose. Therefore we will close this major but we'll continue to monitor the situation throughout the weekend.

Our apologies for any inconvenience.

Posted Sep 29, 2023 - 16:28 CEST

Monitoring

We would like to provide you with another update about the ongoing issues with Action Sequences at our NL3 hosting location.

We've successfully migrated a subset of customers last night, that had issues with Action Sequences, to the new messaging system. Our engineers are closely monitoring the situation and we don't see any issues at the moment.

We will evaluate the situation later today and decide if we move forward with the phased rollout of the new messaging system and move another set of SaaS environments to this new messaging system.

We will keep you updated.

Posted Sep 29, 2023 - 11:50 CEST

Update

We have identified a subset of environments that were most impacted by the recent disruption and have scheduled a priority roll out to new messaging system for these environments tonight.

Our engineers updated the configuration of our current messaging system, which will also receive a restart this evening and expect a positive impact.

During the service window, we will be restarting the subset of environments that are switching to the new messaging system.

We have a rollback scenario in place for the environments that will be switched to the new messaging system.

We will keep you updated on the progress.

Posted Sep 28, 2023 - 18:20 CEST

Identified

We’ve decided to move forward with the rollout of the new messaging system for all environments that have the required version which supports the messaging system (All customers within the continuous deployment update group).

Next week we will roll out more structural improvements to our current messaging system. Even though we are adding more structural improvements, our goal is to eventually phase out this messaging system and replacing it with the new system.

We have executed a risk assessment on phasing out our current messaging system. We are currently mitigating the risks and therefore it will be a phased rollout.

We will keep you updated on the progress.

Our apologies for the inconvenience.

Posted Sep 28, 2023 - 15:42 CEST

Update

We would like to give a general update on what has been going on with the messaging system and where we are currently at:

This morning our messaging system, which is responsible for sending action sequences, was restarted.
During the restart, the queue within the messaging system increased. The system is designed so things will eventually be executed, but because of a backlog of messages this took longer than expected. We improved the capacity for forwarding messages, which should mitigate the backlog.

We are aware of certain drawbacks of our messaging system. Therefore we have a new system ready to be rolled out, but due to old TOPdesk versions this can not be done seamlessly.
If we want to do this seamlessly we need a few more weeks to implement the structural improvements to our infrastructure.

Today we'll discuss whether we want to shorten this timeline, accepting the downsides that this has for customers on older versions.

For now, our engineers are working with the highest priority to mitigate this issue.

Our apologies for the inconvenience.

Posted Sep 28, 2023 - 11:33 CEST

Investigating

We are currently experiencing problems with action sequences on the NL3 hosting location.
As a result your TOPdesk environment may not be available.

We are aware of the problem and are working on a solution.

Our apologies for the inconvenience. At the time of writing this we are not able to give you an estimate on when your environment will be available. We aim to update this status page every 30 minutes until the issue has been resolved.

E-mail updates will be sent when the issue has been resolved. You can subscribe on the status page (https://status.topdesk.com) for additional updates.

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR23 09 7797.

Posted Sep 28, 2023 - 10:15 CEST

This incident affected: NL3 SaaS hosting location.