SaaS disruption NL action sequences not triggering

Incident Report for TOPdesk SaaS Status page

Postmortem

Introduction:

A couple of the TOPdesk functionalities need to work together and communicate. This communication works through a messaging system, in our case Poschd. Poschd recently experienced an overload resulting in issues with the Action Sequences and Knowledge Base functionalities. The cause of these disruptions was mainly due to adding more TOPdesk functionalities into the Poschd system.

Cause:

The move of the Action Sequences and Knowledge Base functionalities to the Poschd system and the increase of the number of environments on the datacenters added substantial load to Poschd. This, coupled with the existing systems already using Poschd, led to an unmanageable rise in messages needing to be processed by Poschd. As a result, Poschd started to buckle under the increased pressure, eventually leading to a Poschd crash in two of our datacenters.

Complications:

When the messaging service was restarted, several functionalities tried to offload their messages simultaneously, causing another overload. It took a lot of time to figure out how to limit the number of messages that Poschd received which was needed for it to start again successfully.

A contributing factor to the overload of Poschd was due to an increasing number of environments on our datacenters. There was no space to distribute these environments to smaller datacenters due to technical difficulties.

Additional information:

We were working on a new messaging system to replace Poschd. The new messaging service will not run into the same problems as Poschd, it is more robust and up to date. This was an extensive and time-consuming project. At the moment the issues with Poschd started this system was not completely ready to be implemented yet.

Summary:

Our messaging system, Poschd recently experienced an overload. The cause was primarily due to more functionalities that needed to communicate through the messaging system.

The transition of Action Sequences and Knowledge Base functionalities to Poschd and the increasing number of environments on the datacenters added a significant and unexpected extra load on the messaging system. This resulted in a crash of the messaging system at two of our datacenters. The restart of the messaging service led to another overload as multiple

functionalities tried to offload messages at once. The surge in datacenter environments and the lack of space to distribute these environments to smaller datacenters also contributed to the overload. Efforts to create a new messaging system to replace Poschd were underway but not completed.

Of course, as an organization we keep on learning and adapting our processes to better handle and prevent any possible issues in the future.

Follow-up actions:

  1. All environments have been moved to the new messaging service as of November 2023.
  2. Build more and smaller datacenters, so functionalities cannot be overloaded. The first new datacenter is being deployed. We will start distributing environments as soon as the datacenter is ready. Our goal is to complete this project in 2024. The creation of these new datacenters has our highest priority.
  3. A dedicated project group has been created in order to update and further formalize our procedures around incidents where multiple customers are affected. This way customers will be updated in a timely manner and know what we are doing to solve the problem. Also, customers will get updates on the symptoms of the problem.
Posted Dec 04, 2023 - 13:43 CET

Resolved

After a full day of monitoring the situation our engineers have concluded that this issue is resolved.
Therefore we will close this incident.

If you still experience any issues relating the Checklists or Action Sequences within your TOPdesk environment, please reach out to TOPdesk Support.

A postmortem will follow once the evaluation is completed.
Posted Sep 28, 2023 - 09:18 CEST

Update

Our engineers do not see any issues with our messaging system and Action Sequences are fully operational.
We will continue to monitor the situation.

Please reach out to TOPdesk Support if you still have issues within TOPdesk
Posted Sep 27, 2023 - 11:09 CEST

Monitoring

Our engineers have been monitoring the situation and all issues seem to be mitigated.

The restart of the messaging system has been scheduled for 21:00 CEST, this evening.
Once this restart has concluded, all TOPdesk environments will be restarted during the Service Window tonight.

Please reach out to TOPdesk Support if you still have issues within TOPdesk after the restart.

We will continue to monitor the situation.
Posted Sep 26, 2023 - 16:44 CEST

Update

Our engineers have investigated the messaging queue system and have concluded that all issues should be mitigated.
Any new issues will be investigated separately by our Support team.

This evening we will be restarting the messaging system to force changes made by our engineers. This might cause a delay within the messaging system.
Once the messaging system has rebooted, all environments will get a restart during the Service Window tonight.

Our apologies for the inconvenience.
Posted Sep 26, 2023 - 12:07 CEST

Update

We are still getting reports from customers that Action Sequences and the Checklists are encountering issues.

The scheduled restart of the TOPdesk environments last night has helped clear most of the messages in the queue.

Our engineers are working on this issue with the highest priority.

Please reach out to TOPdesk Support if you still have issues with Action Sequences or Checklists within TOPdesk.
Posted Sep 26, 2023 - 09:23 CEST

Update

The ongoing issues are currently only affecting the Checklists (Lab Feature) subsystem within TOPdesk. The issues with Action Sequences should be mitigated. If you still encounter issues with the Action Sequences please reach out to TOPdesk Support.

To address the problems with the Checklists feature, we will schedule a restart of the TOPdesk environments we suspect difficulties (messages in the queue) with this subsystem during our Service Window tonight.

We will provide another update tomorrow morning.
Posted Sep 25, 2023 - 16:34 CEST

Update

Our engineers have successfully rebooted the messaging system which is necessary for the full functionality of Action Sequences. We see a subset of messages still in the queue, indicating that a small group of customers are still facing issues.

Please reach out to TOPdesk Support if you still have issues with Action Sequences or Checklists within TOPdesk.
Posted Sep 25, 2023 - 14:08 CEST

Update

We are currently reaching the limit of our internal message system which is crucial for communicating and processing of Action Sequences within TOPdesk.

Last weekend we rebooted the internal message system for all TOPdesk environments in our NL3 hosting location, which mitigated the issue for the majority of our customers. Due to peak load the issue persists in some TOPdesk environments. Our engineers are currently restarting this message system, but in a staged manner to prevent peak load again.

Messages that were sent during the disruption are stored in the queue of the message system. We are currently processing this queue in fast pace. We expect that Action Sequences will be processed, but in a delayed manner.

At the same time, we are implementing a more reliable and future proof message system in a staged roll-out.

Our apologies for the inconvenience. We will provide an update around 2:00 PM CEST.
Posted Sep 25, 2023 - 12:05 CEST

Update

We are still experiencing intermittent issues with Action Sequences or Checklists (Labs feature) at our NL3 hosting location. Our engineers are working on this issue with the highest priority.

Our apologies for the inconvenience.
Posted Sep 25, 2023 - 09:50 CEST

Update

We are still investigating the issue.

We apologize for any inconvenience.

Next update: 16:30
Posted Sep 22, 2023 - 15:31 CEST

Update

We are still investigating the issue.

We apologize for any inconvenience.

Next update: 15:00
Posted Sep 22, 2023 - 14:17 CEST

Update

We are continuing to investigate this issue.
Posted Sep 22, 2023 - 13:29 CEST

Update

Our engineers are still investigating the issue.

We sincerely apologize for any inconvenience.

Next update will be 14:00
Posted Sep 22, 2023 - 12:52 CEST

Update

We are still investigating the issue.

We apologize for any inconvenience.

Next update will be 12:30
Posted Sep 22, 2023 - 11:42 CEST

Update

We are continuing the investigation.

Next update will be at 11:30
Posted Sep 22, 2023 - 11:00 CEST

Update

Our engineers are actively continuing their investigation of the issue.


We sincerely apologize for any inconvenience this may have caused.

Next update will be 11:00
Posted Sep 22, 2023 - 10:16 CEST

Investigating

We are currently experiencing issues with action sequences not triggering at our NL3 hosting location

We are aware of the problem and are investigating.

Our apologies for the inconvenience. We aim to update this status blog at least every 30 minutes until the issue has been resolved.

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR23 09 6210.
Posted Sep 22, 2023 - 09:56 CEST
This incident affected: NL3 SaaS hosting location.