During an upgrade, a few instances were not coming back online, which led to the discovery that the TOPdesk messaging system was allocated insufficient memory. Although initial attempts were made to increase the memory, the problem persisted, affecting customer Action Services. The issue was complicated by the database server running out of disk space and the roll-out of a new version of Action Service to handle the queue of actions awaiting to be executed. Despite these challenges, the team was able to temporarily resolve the issue. However, the change to increase the TOPdesk messaging system memory was unexpectedly reverted, causing further problems. After adjusting some settings and ensuring messages were picked up, the issue seemed resolved. Yet, later in the day, problems with the messaging system led to a build-up of unsent messages. The team then identified a case where the bugfix roll-out did not work as intended and applied a fix.
The incident highlighted the need for better system metrics, alert mechanisms, and communication protocols. Measures have been taken by various development teams to prevent a recurrence and improve the system's response to such incidents.