On June 30th at 11:22 (AM) CEST the first customer reported performance issues while working in their TOPdesk environment. An hour later, several customers had reported performance issues and the issue was escalated to investigate a common root cause.
The TOPdesk SaaS hosting team identified hardware issues as a possible cause, and contacted the hosting provider to verify this theory. At 13:05 (01:05 PM) the hosting provider confirmed there were issues with the router for the NL3 hosting location that caused performance issues.
The problem was communicated to customers via our Status page and My TOPdesk. As there were previous indications the router was nearing it’s maximum capacity, a replacement router had already been ordered and was ready to be taken into production. Together with engineers at the hosting provider, emergency maintenance was scheduled to replace the router.
The router still had to be configured, so we decided to schedule the maintenance at the end of the next day, to allow the hosting provider to properly test the new configuration. Several measures were taken to reduce the load on the router in the meantime. We chose a maintenance time with minimal customer impact where engineers from both teams were available to troubleshoot any issues.
The router was replaced at 22:00 (10:00PM) on Wednesday July 1st. Engineers were available during a late shift on Wednesday, and an early shift on Thursday to remediate any remaining issues, but no further issues were found or reported. During the router replacement customers may have experienced connection issues for a few seconds.
Root cause
The performance issues in the NL3 datacenter were caused by a router that was at its maximum capacity. Replacing the router was already scheduled, but had to be carried out sooner. The replacement router was already ordered, but still had to be configured and taken into production.
Action points
The replacement router will not have any similar capacity issues in the forseeable future. We’ve scheduled the creation of another hosting location in Europe to further reduce the load on the NL3 hosting location infrastructure.
Several possible improvements were identified in the way our internal investigation incident was initially created and communicated. We’re scheduling a refresher training for all teams that might create and publish a major incidents in the future.