RCA for Wednesday 2020-09-09
Timeline
11:30 The SaaS Operations team started maintenance on the NL3 network. To estimate the impact of the maintenance on the network this was first done on the data centers in the Americas. Since there was no impact on the availability of the network, our engineers started the maintenance.
12:00 Our monitoring spotted the first signs of problems with the authentication services.
12:25 The automated process created a major incident in our system and we started the major incident procedure.
12:30 The SaaS Operations team started the investigation.
12:43 The pods for the authentication service were restarted.
12:46 The monitoring showed green light for the authentication service.
Root cause
During the maintenance, one of the authentication services was unable to communicate with a key part of the services infrastructure and this service did not automatically recover.
Follow-up actions
- The major incident procedure will be updated to work with the automated major system.
- Development is investigating a fix to make the key part for the services infrastructure more redundant and resilient.
- The release strategy will be discussed, we aim for a balance between fast updates and stability.
- We will discuss if there is a more representative test for maintenance changes in order to better estimate the impact on production environments.