Action sequences not triggering at NL3 hosting location

Incident Report for TOPdesk SaaS Status page

Postmortem

Introduction:

A couple of the TOPdesk functionalities need to work together and communicate. This communication works through a messaging system, in our case Poschd. Poschd recently experienced an overload resulting in issues with the Action Sequences and Knowledge Base functionalities. The cause of these disruptions was mainly due to adding more TOPdesk functionalities into the Poschd system.

Cause:

The move of the Action Sequences and Knowledge Base functionalities to the Poschd system and the increase of the number of environments on the datacenters added substantial load to Poschd. This, coupled with the existing systems already using Poschd, led to an unmanageable rise in messages needing to be processed by Poschd. As a result, Poschd started to buckle under the increased pressure, eventually leading to a Poschd crash in two of our datacenters.

Complications:

When the messaging service was restarted, several functionalities tried to offload their messages simultaneously, causing another overload. It took a lot of time to figure out how to limit the number of messages that Poschd received which was needed for it to start again successfully.

A contributing factor to the overload of Poschd was due to an increasing number of environments on our datacenters. There was no space to distribute these environments to smaller datacenters due to technical difficulties.

Additional information:

We were working on a new messaging system to replace Poschd. The new messaging service will not run into the same problems as Poschd, it is more robust and up to date. This was an extensive and time-consuming project. At the moment the issues with Poschd started this system was not completely ready to be implemented yet.

Summary:

Our messaging system, Poschd recently experienced an overload. The cause was primarily due to more functionalities that needed to communicate through the messaging system.

The transition of Action Sequences and Knowledge Base functionalities to Poschd and the increasing number of environments on the datacenters added a significant and unexpected extra load on the messaging system. This resulted in a crash of the messaging system at two of our datacenters. The restart of the messaging service led to another overload as multiple

functionalities tried to offload messages at once. The surge in datacenter environments and the lack of space to distribute these environments to smaller datacenters also contributed to the overload. Efforts to create a new messaging system to replace Poschd were underway but not completed.

Of course, as an organization we keep on learning and adapting our processes to better handle and prevent any possible issues in the future.

Follow-up actions:

  1. All environments have been moved to the new messaging service as of November 2023.
  2. Build more and smaller datacenters, so functionalities cannot be overloaded. The first new datacenter is being deployed. We will start distributing environments as soon as the datacenter is ready. Our goal is to complete this project in 2024. The creation of these new datacenters has our highest priority.
  3. A dedicated project group has been created in order to update and further formalize our procedures around incidents where multiple customers are affected. This way customers will be updated in a timely manner and know what we are doing to solve the problem. Also, customers will get updates on the symptoms of the problem.
Posted Dec 04, 2023 - 13:36 CET

Resolved

The fix is working as intended and queued actions have been processed.
We apologize for any inconvenience this issue may have caused.
Posted Sep 21, 2023 - 14:43 CEST

Monitoring

The fix has been implemented.
As it will take a while until the queued actions will all have been processed, delays can still be experienced for some time.
We'll monitor the situation at least for the next hour.
Posted Sep 21, 2023 - 13:52 CEST

Update

The fix has been implemented.
As it will take a while until the queued actions will all have been processed, delays can still be experienced for some time.
We'll monitor the situation at least for the next hour.
Posted Sep 21, 2023 - 13:36 CEST

Update

A fix is being implemented
Posted Sep 21, 2023 - 12:09 CEST

Update

We found the issue the messaging service is facing and are working on it
Posted Sep 21, 2023 - 11:14 CEST

Identified

We pinpointed the issue to a problem with the messaging service, not delivering the triggers to the action service.
Posted Sep 21, 2023 - 10:37 CEST

Investigating

We are currently experiencing issues with action sequences not triggering at our NL3 hosting location
We are aware of the problem and are investigating.

Our apologies for the inconvenience. We aim to update this status blog at least every 30 minutes until the issue has been resolved.
Posted Sep 21, 2023 - 10:18 CEST
This incident affected: NL3 SaaS hosting location.