RESOLVED: Performance Issues for UK1 environments

Incident Report for TOPdesk SaaS Status page

Postmortem

Incident Summary:

On August 21st, 2024, we encountered the first in a series of performance issues affecting a subset of SaaS environments hosted in the UK1 data center. Over the subsequent weeks, our team conducted an extensive investigation, engaging in numerous internal discussions, customer communications, and collaboration with third-party service providers to address and resolve the disruption.

Investigation and Actions Taken:

The initial performance issues were logged on August 21st, with a significant increase in reported cases on August 23rd. Throughout that day, our team continuously updated the status page and engaged in extensive internal communication and customer consultations. Multiple calls were held with various stakeholders to better understand the scope and impact of the issue.

As part of the investigation, we established new dashboards to display the volume of incoming requests at each ingress point. Utilizing these metrics, we observed several sudden drops in requests at the London (LDN) ingress point, which coincided with the times our customers reported experiencing slowness.

In the following days, our investigation included conducting MTR traceroutes to determine where requests were being lost and engaging directly with affected customers. We also restarted our edge proxies and closely monitored the situation. Despite our comprehensive efforts, the cause remained undetermined.

By August 28th, we proposed migrating the affected customers to the UK2 data center to alleviate the problem, as the issues appeared to be specific to the infrastructure on UK1. This proposal was communicated to all affected customers, and upon receiving their approval, the migrations to UK2 were scheduled accordingly.

Throughout early September, we continued our collaboration with our service providers. Extensive investigations revealed no packet loss within their networks. We conducted further tests and internal investigations, adding probes to track missing connections and continuously monitoring the situation.

By mid-September, the issue appeared to be intermittent, and no further significant problems were reported. This could be due to most affected customers being migrated to the other data center, potentially resulting in less traffic for the ingress point to handle. However, we couldn't rule out the possibility that the issue resolved on its own. After internal discussions, the team decided to finalize and close off the investigation on September 18th, updating the status page to RESOLVED.

Root Cause Analysis:

Despite our extensive efforts and the significant resources devoted to diagnosing the problem, we were unable to pinpoint the exact root cause of the performance issues. Several factors were identified as potential contributors:

Connection issues at the London (LDN) ingress point: We observed missing connections at this ingress point, suggesting potential disruptions.
Specific to the UK1 infrastructure: The problems did not occur at the other data center, indicating that the issues may be isolated to UK1.

Importantly, we were able to exclude all components within our own infrastructure as the culprits. Our comprehensive tests and investigations confirmed that our internal systems were operating correctly, reinforcing our focus on external factors as potential sources of the problem.

We narrowed the issue down to the perimeter between our CDN and IaaS providers. The complexity of the issue arises from the difficulty in determining the source of latency when we only manage a portion of the network. Furthermore, accurately pinpointing the cause is challenging due to the intricate and often unpredictable nature of the internet.

Conclusion:

We understand the frustration and inconvenience these performance issues have caused our customers. Our team dedicated significant resources and time to diagnose and resolve the problem, engaging with multiple stakeholders and conducting in-depth technical investigations. We remain committed to ensuring the highest levels of service reliability and performance.

Thank you for your patience and understanding. If you have any further questions or concerns, please do not hesitate to reach out to our support team.

Posted Oct 15, 2024 - 13:37 CEST

Resolved

The customer reports coincided with high fluctuations in request rate at one specific entry point of our network: the entry point to our UK1 data center. In collaboration with our network providers we investigated why this happened. We have conducted thorough investigations in collaboration with our providers, and successfully excluded quite a few potential culprits. Despite this, the root cause remains undetermined.

Over the past week, our monitoring systems indicate that the issue has subsided, and the overall network performance remains stable. This is confirmed by our customers. With the issue no longer occurring, we cannot continue our search for a root cause analysis and see no other option than closing this incident.

We will continue to keep a close eye on our systems to ensure ongoing stability. Should any related issues arise, please do not hesitate to reach out to our support team.

As part of our ongoing efforts, 76% of affected environments have been moved to our UK2 data center where the issue was never present. We plan to migrate all remaining UK1 environments to UK2 within the next period.

Thank you for your patience and understanding

Posted Sep 18, 2024 - 16:33 CEST

Update

We have received minimal feedback regarding network latency today. In addition to the reduced reports, our monitoring indicates that the issue has largely subsided. While this is a positive development, our engineers are still actively monitoring the situation to ensure stability and performance.

Your patience and understanding as we work towards a permanent resolution are greatly appreciated. Further updates will be provided as soon as more information becomes available.

Posted Sep 11, 2024 - 17:03 CEST

Update

We have added more metrics to our monitoring to gain better insights into the issue related to intermittent network latency.

While we do not have significant new information at this time, please rest assured that we are actively investigating the issue. We appreciate your patience and understanding as we work towards a solution.

Further updates will be provided as soon as more information becomes available.

Posted Sep 09, 2024 - 17:00 CEST

Update

We are pleased to report that we have received no reports of network latency or slowness from our customers today. While our system appears to be stable overall, we did observe a number of drops in performance today.

Our engineers are actively investigating these occurrences. Although we have ruled out several potential culprits, we have not yet determined the underlying cause.

We will continue to closely monitor the situation and will provide an update next week.

Thank you for your patience and understanding.

Posted Sep 06, 2024 - 15:46 CEST

Update

As of 10:00 AM CEST, our engineers have not detected any further issues.

The system appears stable, and we will continue to monitor and investigate to ensure everything remains operational.

We will provide further updates tomorrow. Thank you for your patience and understanding.

Posted Sep 05, 2024 - 17:01 CEST

Update

We want to assure you that our team is fully engaged in resolving the network performance issues that have been impacting your experience. Although we don't have significant new information to share at this moment, please know that our engineers and Cloudflare experts are actively working behind the scenes.

In the past day, we've made progress by ruling out several potential causes, which helps us narrow down the investigation.

We appreciate your patience and understanding as we continue to work towards a solution.
Further updates will be provided as soon as more information becomes available.

Posted Sep 05, 2024 - 10:57 CEST

Update

We are still investigating the ongoing network performance issues.

Our engineers have been testing extensively and have ruled out a few scenarios as potential causes. However, we have yet to determine the root cause.

We appreciate your patience and will keep you updated on our progress.

Posted Sep 04, 2024 - 16:45 CEST

Update

Our engineers are currently looking into the results we've received from our CDN provider, Cloudflare.

Further updates will be provided as soon as more information becomes available. Your patience and understanding during this time are sincerely appreciated.

Posted Sep 04, 2024 - 10:31 CEST

Update

We're continuing our investigation and are currently awaiting results from our CDN provider, Cloudflare.

We'll provide an update on the status page as soon as we have more information. We sincerely appreciate your patience and understanding during this time.

Posted Sep 03, 2024 - 11:40 CEST

Update

Despite our thorough investigation throughout the day, we have not yet been able to resolve the latency issue

Our engineers remain in close collaboration with Cloudflare to resolve this matter as quickly as possible

Thank you for your ongoing patience and understanding.

Posted Sep 02, 2024 - 16:48 CEST

Update

We are continuing to work closely with our CDN provider, Cloudflare, to investigate and resolve the ongoing latency issue.

We apologize for any inconvenience this may cause and appreciate your patience. We will provide further updates as soon as more information becomes available.

Posted Sep 02, 2024 - 10:43 CEST

Update

Our engineers have been investigating the issue, and based on their current findings, it appears that there is a problem between the internet backbone and the Cloudflare access point in London.

The slowness issue has been successfully reproduced by the team, and we are actively collaborating with our CDN provider, Cloudflare, to resolve it as quickly as possible. We understand the inconvenience this has caused and sincerely apologize for the extended disruption.

Posted Aug 30, 2024 - 13:57 CEST

Update

Our engineers have identified that the drop in traffic may be linked to recent maintenance work, during which traffic was rerouted. We are actively working on implementing additional metrics to gain a better understanding of the situation.

We are also in contact with our provider, Cloudflare, to diagnose and resolve this issue as swiftly as possible. So far, approximately 30 customers have reported intermittent degraded performance.

We apologize for any inconvenience this may cause and appreciate your patience as we work to resolve these issues. Further updates will be provided as soon as more information becomes available.

Posted Aug 29, 2024 - 12:09 CEST

Update

We are continuing to monitor for any further issues.

Posted Aug 28, 2024 - 15:27 CEST

Update

We have been reviewing the HAR files provided and have identified that the intermittent slowness may be linked to specific browsers, plugins, or local virus scanners.

Our Support team is actively gathering more information from those affected to better understand and address the issue.

In case you experience any slowness, please provide us with more detailed information about the used browser and plugins.

Thank you for your patience and understanding as we work to resolve this matter.

Posted Aug 28, 2024 - 11:09 CEST

Update

We have been closely monitoring our dashboards and have observed that the issue with degraded performance appears to be intermittent. Due to this intermittent nature, we have yet to successfully reproduce the issue consistently.

This behavior has also been confirmed by a handful of our customers who have experienced slowness today and kindly provided us with HAR files for further analysis.

Please rest assured that our team is continuing to investigate this matter thoroughly. We will provide another update and check in with our customers tomorrow.

Thank you for your patience and understanding.

Posted Aug 27, 2024 - 16:20 CEST

Update

Our engineers are actively investigating the cause of the performance issues in our SaaS environments hosted on UK1.

If you are still experiencing any problems, we kindly encourage you to reach out to our support team.

Your assistance in this matter would be greatly appreciated; please provide a HAR file and the specific time when you noticed the slowness to support our investigation. Thank you for your cooperation.

Posted Aug 27, 2024 - 10:30 CEST

Monitoring

We are no longer receiving reports of degraded performance. However, we have not yet identified the root cause. Therefore, we will continue to monitor the performance of TOPdesk SaaS environments.

We will update our status page tomorrow morning.

Posted Aug 26, 2024 - 16:06 CEST

Update

We are still in contact with our provider and investigating the performance issues. We expect the impact still to be limited.

Please provide a .HAR file in your incident if you are still experiencing performance issues to help us with the investigation.

We apologize for the delay, and we appreciate your patience at this time.

Posted Aug 26, 2024 - 11:58 CEST

Update

We expect performance to be returning to more manageable levels, but are still investigating the root cause of the issue. We expect the impact now to be limited.

We apologise for the delay, and we appreciate your patience at this time.

Posted Aug 23, 2024 - 17:56 CEST

Update

We are still waiting to hear back from our hosting provider regarding any issues that could have caused the slowness.

We apologise for the delay, and we appreciate your patience at this time.

Posted Aug 23, 2024 - 16:31 CEST

Update

We are still investigating this issue and are waiting to hear back from our hosting provider regarding any issues that could have caused slowness for our customers in the UK1 datacentre.

We apologise for the delay, and we appreciate your patience at this time.

Posted Aug 23, 2024 - 15:31 CEST

Update

We believe that the issue lies in the network connection at our Leaseweb datacentre. We have contacted Leaseweb directly, and are waiting for a response on whether they have found any network issues on their end

We will provide you with an update in 30 minutes or sooner.

Posted Aug 23, 2024 - 15:03 CEST

Update

We are still investigating the slowness on the affected environments with our specialists.

We apologise for the delay, and we appreciate your patience at this time.

Posted Aug 23, 2024 - 14:32 CEST

Update

The investigationn of the slowness on the affected environments with our specialists is stil going on.

We apologise for the delay, and we appreciate your patience at this time.

Posted Aug 23, 2024 - 14:00 CEST

Update

We are still investigating the slowness on the affected environments with our specialists.

We apologise for the delay, and we appreciate your patience at this time.

Posted Aug 23, 2024 - 13:30 CEST

Update

We are currently have a large team of people still investigating the slowness on the affected environments.

We apologise for the delay, and we appreciate your patience at this time.

Posted Aug 23, 2024 - 13:02 CEST

Update

We have been investigating incidents with the technical team, the saas support team, and are now reaching out to the saas operations team, in order to look at the possibility of a network issue in the datacentre, as we believe that this is not a resourcing issue.

We apologise for the delay, and we appreciate your patience at this time.

Posted Aug 23, 2024 - 12:30 CEST

Update

We are still investigating the performance issues that are impacting environments for customers on the UK1 datacentre. As a result, your environment may be slow in opening cards and selecting dropdowns, among general slowness.

We will be providing you with an update every 30 minutes.

We apologise for the inconvenience and we appreciate your patience at this time.

Posted Aug 23, 2024 - 12:00 CEST

Investigating

We are currently experiencing system wide disruptions with the UK1 hosting location. As a result your TOPdesk environment may not be available.

We are aware of the problem and are working on a solution.

Please contact TOPdesk Support after submitting an incident, so we can store additional logfiles to investigate this.

Our apologies for the inconvenience. At the time of writing this we are not able to give you an estimate on when your environment will be available. We aim to update this status page every 30 minutes until the issue has been resolved.

E-mail updates will be sent when the issue has been resolved. You can subscribe on the status page (https://status.topdesk.com) for additional updates.

To inform TOPdesk you are affected by this issue, please visit https://my.topdesk.com/tas/public/ssp/ . Please refer to incident TDR24 08 6465.

Posted Aug 23, 2024 - 11:24 CEST

This incident affected: UK1 SaaS hosting location.