On August 21st, 2024, we encountered the first in a series of performance issues affecting a subset of SaaS environments hosted in the UK1 data center. Over the subsequent weeks, our team conducted an extensive investigation, engaging in numerous internal discussions, customer communications, and collaboration with third-party service providers to address and resolve the disruption.
The initial performance issues were logged on August 21st, with a significant increase in reported cases on August 23rd. Throughout that day, our team continuously updated the status page and engaged in extensive internal communication and customer consultations. Multiple calls were held with various stakeholders to better understand the scope and impact of the issue.
As part of the investigation, we established new dashboards to display the volume of incoming requests at each ingress point. Utilizing these metrics, we observed several sudden drops in requests at the London (LDN) ingress point, which coincided with the times our customers reported experiencing slowness.
In the following days, our investigation included conducting MTR traceroutes to determine where requests were being lost and engaging directly with affected customers. We also restarted our edge proxies and closely monitored the situation. Despite our comprehensive efforts, the cause remained undetermined.
By August 28th, we proposed migrating the affected customers to the UK2 data center to alleviate the problem, as the issues appeared to be specific to the infrastructure on UK1. This proposal was communicated to all affected customers, and upon receiving their approval, the migrations to UK2 were scheduled accordingly.
Throughout early September, we continued our collaboration with our service providers. Extensive investigations revealed no packet loss within their networks. We conducted further tests and internal investigations, adding probes to track missing connections and continuously monitoring the situation.
By mid-September, the issue appeared to be intermittent, and no further significant problems were reported. This could be due to most affected customers being migrated to the other data center, potentially resulting in less traffic for the ingress point to handle. However, we couldn't rule out the possibility that the issue resolved on its own. After internal discussions, the team decided to finalize and close off the investigation on September 18th, updating the status page to RESOLVED.
Despite our extensive efforts and the significant resources devoted to diagnosing the problem, we were unable to pinpoint the exact root cause of the performance issues. Several factors were identified as potential contributors:
Importantly, we were able to exclude all components within our own infrastructure as the culprits. Our comprehensive tests and investigations confirmed that our internal systems were operating correctly, reinforcing our focus on external factors as potential sources of the problem.
We narrowed the issue down to the perimeter between our CDN and IaaS providers. The complexity of the issue arises from the difficulty in determining the source of latency when we only manage a portion of the network. Furthermore, accurately pinpointing the cause is challenging due to the intricate and often unpredictable nature of the internet.
We understand the frustration and inconvenience these performance issues have caused our customers. Our team dedicated significant resources and time to diagnose and resolve the problem, engaging with multiple stakeholders and conducting in-depth technical investigations. We remain committed to ensuring the highest levels of service reliability and performance.
Thank you for your patience and understanding. If you have any further questions or concerns, please do not hesitate to reach out to our support team.