SaaS Action Sequences disruption NL3

Incident Report for TOPdesk SaaS Status page

Postmortem

Introduction:

A couple of the TOPdesk functionalities need to work together and communicate. This communication works through a messaging system, in our case Poschd. Poschd recently experienced an overload resulting in issues with the Action Sequences and Knowledge Base functionalities. The cause of these disruptions was mainly due to adding more TOPdesk functionalities into the Poschd system.

Cause:

The move of the Action Sequences and Knowledge Base functionalities to the Poschd system and the increase of the number of environments on the datacenters added substantial load to Poschd. This, coupled with the existing systems already using Poschd, led to an unmanageable rise in messages needing to be processed by Poschd. As a result, Poschd started to buckle under the increased pressure, eventually leading to a Poschd crash in two of our datacenters.

Complications:

When the messaging service was restarted, several functionalities tried to offload their messages simultaneously, causing another overload. It took a lot of time to figure out how to limit the number of messages that Poschd received which was needed for it to start again successfully.

A contributing factor to the overload of Poschd was due to an increasing number of environments on our datacenters. There was no space to distribute these environments to smaller datacenters due to technical difficulties.

Additional information:

We were working on a new messaging system to replace Poschd. The new messaging service will not run into the same problems as Poschd, it is more robust and up to date. This was an extensive and time-consuming project. At the moment the issues with Poschd started this system was not completely ready to be implemented yet.

Summary:

Our messaging system, Poschd recently experienced an overload. The cause was primarily due to more functionalities that needed to communicate through the messaging system.

The transition of Action Sequences and Knowledge Base functionalities to Poschd and the increasing number of environments on the datacenters added a significant and unexpected extra load on the messaging system. This resulted in a crash of the messaging system at two of our datacenters. The restart of the messaging service led to another overload as multiple

functionalities tried to offload messages at once. The surge in datacenter environments and the lack of space to distribute these environments to smaller datacenters also contributed to the overload. Efforts to create a new messaging system to replace Poschd were underway but not completed.

Of course, as an organization we keep on learning and adapting our processes to better handle and prevent any possible issues in the future.

Follow-up actions:

  1. All environments have been moved to the new messaging service as of November 2023.
  2. Build more and smaller datacenters, so functionalities cannot be overloaded. The first new datacenter is being deployed. We will start distributing environments as soon as the datacenter is ready. Our goal is to complete this project in 2024. The creation of these new datacenters has our highest priority.
  3. A dedicated project group has been created in order to update and further formalize our procedures around incidents where multiple customers are affected. This way customers will be updated in a timely manner and know what we are doing to solve the problem. Also, customers will get updates on the symptoms of the problem.
Posted Dec 04, 2023 - 13:52 CET

Resolved

We would like to provide another update about the ongoing issues in our NL3 hosting location.

Over the course of the last 2 days, we have been migrating the majority of hosted production environments to our new messaging system, with positive results.

We are continuing the migration of all eligible environments as quickly as possible. This means that we will migrate all test environments this upcoming night to the new messaging system.

Our engineers are monitoring the performance of the old and the new messaging system.

The Checklists feature (TOPdesk Labs) which is used by a small subset of customers is still using the old messaging system but will be migrated in the next weeks.

We are working on phasing out the old messaging system with the highest priority and we expect to finish this in the near future.

Should you still encounter issues with the Checklists or Action Sequences, please contact TOPdesk Support.

Once the ongoing issues are successfully resolved, we will publish a Root Cause Analysis (RCA) report on the Status page (https://status.topdesk.com).
Posted Oct 04, 2023 - 16:32 CEST

Update

After monitoring the situation we can now conclude that the issues with our messaging system are resolved and that action sequences are operational again

Starting tonight, over the course of the next 2 days, we will migrate all eligible environments to the new messaging system.
We expect this action will mitigate the main issues with messaging and action sequences that have occurred over the past period for those environments. 

We will continue to monitor the situation and keep this major open for further investigations and until mitigations has been successfully implemented.
Posted Oct 02, 2023 - 15:59 CEST

Monitoring

Our engineers have successfully rebooted the messaging system which is necessary for the full functionality of Action Sequences.

Messages that were sent during the disruption are stored in the queue of the message system.

We are currently processing this queue in fast pace.

We expect that Action Sequences will be processed, but in a delayed manner.
Since the messages stored in the message system are now being processed, we will start monitoring the system. 

Please reach out to TOPdesk Support if you still have issues within TOPdesk
Posted Oct 02, 2023 - 11:05 CEST

Update

We are continuing to investigate this issue.
Posted Oct 02, 2023 - 10:38 CEST

Update

Our engineers are actively continuing their investigation of the issue.

We sincerely apologize for any inconvenience this may have caused.

Next update will be 11:00
Posted Oct 02, 2023 - 10:11 CEST

Investigating

We are currently experiencing problems with action sequences on the NL3 hosting location.
As a result your TOPdesk environment may not be available.

We are aware of the problem and are working on a solution.

Our apologies for the inconvenience.

E-mail updates will be sent when the issue has been resolved. You can subscribe on the status page (https://status.topdesk.com) for additional updates.

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR23 09 7797.
Posted Oct 02, 2023 - 09:57 CEST
This incident affected: NL3 SaaS hosting location.