Disruption in NL3 hosting location

Incident Report for TOPdesk SaaS Status page

Postmortem

RCA for Wednesday 2020-09-09

Timeline
11:30 The SaaS Operations team started maintenance on the NL3 network. To estimate the impact of the maintenance on the network this was first done on the data centers in the Americas. Since there was no impact on the availability of the network, our engineers started the maintenance.
12:00 Our monitoring spotted the first signs of problems with the authentication services.
12:25 The automated process created a major incident in our system and we started the major incident procedure.
12:30 The SaaS Operations team started the investigation.
12:43 The pods for the authentication service were restarted.
12:46 The monitoring showed green light for the authentication service.

Root cause
During the maintenance, one of the authentication services was unable to communicate with a key part of the services infrastructure and this service did not automatically recover.

Follow-up actions
- The major incident procedure will be updated to work with the automated major system.
- Development is investigating a fix to make the key part for the services infrastructure more redundant and resilient.
- The release strategy will be discussed, we aim for a balance between fast updates and stability.
- We will discuss if there is a more representative test for maintenance changes in order to better estimate the impact on production environments.

Posted Sep 16, 2020 - 13:49 CEST

Resolved

All the environments are reachable again.

We will evaluate the cause of this major in order to prevent similar issues in the future.
Posted Sep 09, 2020 - 13:39 CEST

Monitoring

We have identified an issue. A problem has been found with one of the services which handles authentication.

This has been fixed and we expect that the problems should not occur anymore, please update your incident if you continue to have problems accessing your environment.
Posted Sep 09, 2020 - 12:53 CEST

Investigating

There are problems reaching SaaS environments on the NL3 data center, we are investigating.

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR20 09 2570.
Posted Sep 09, 2020 - 12:33 CEST
This incident affected: NL3 SaaS hosting location.