During synchronizing databases within the database cluster the primary node became unresponsive and received an automated reboot signal from STONITH. The secondary node was still synchronizing the databases and therefore wasn't able to promote itself, which caused the cluster to become unstable and unable to select a primary.
The first node wasn't able to recover automatically and needed manual intervention for recovery and was subsequently manually forced to become the master node.
In order to have the above situation to occur the following conditions needed to be met:
As such, a combination of conditions is exceedingly rare so we do not expect similar outages to occur in the future.
On top of the above:
In the unlikely event of a similar outage occurring in the future we have prepared a script to more quickly recover from the outage.
Lastly, we will train our operations staff how to act during these outages in the future.