On Thursday May 28th from 12:15 (12:15 PM) CEST we received several reports of customers having difficulties accessing their environments in our UK1 datacenter.
Our initial research into possible causes showed an outage at British Telecom at that time as well, which proved to be unrelated.
Further research pointed towards the authentication service being affected. Subsequently, at 13:30 (01:30 PM) CEST the issue was escalated to the responsible development team who confirmed an issue with our authentication service (the passlayer).
At 14:31 (02:31 PM) CEST the issue was resolved, and customers could log in again.
We continued to monitor the situation and no remaining issues were identified.
Root cause
During a routine rollover of a new version in our authentication system, part of the system became unresponsive due to a transient network error. This prevented the update to finish successfully.
Follow up actions
We have been working on improving the process around authentication service upgrades, in order to ensure that our system handles these problems in a more robust way.
Our apologies for the delay in publishing this RCA. This was due to a miscommunication between the team investigating and the team in charge of publishing the RCA. We have reassessed our major incident communication processes to ensure such delays will not occur in the future.