Increased error rate in Cypress Dashboard

Incident Report for Cypress

Postmortem

At approximately 2:00PM EST, we encountered a situation which exhausted our database connection pool. All sessions were stuck performing expensive work, which caused cascading problems as new requests attempted to interact with the database. This impacted both reads and writes to the production database, which was apparent both to end users viewing the Dashboard and CI jobs attempting to record Cypress tests.

During investigation, we identified a query that was causing a bottleneck for many requests. We flushed sessions that were executing this request to give us more headroom to investigate and restore the service. Around 3:30PM EST, we released a hotfix to mitigate this bottleneck. This restored service to acceptable levels.

After further assessing what led to the initial pool exhaustion, we opted to perform immediate maintenance on our production database. We were able to tune the system to prevent a similar failure from occurring in the near future. As of 4:00PM EST, we were back to fully-operational.

We're continuing to evaluate ways to further improve the performance of the Dashboard service. We plan to address many of these in our next development sprint (e.g. serving more requests from our read-replica).

We apologize for the inconvenience this caused you today. We know you rely on the Cypress Dashboard for your build processes, and we're focused on ensuring we interrupt those builds as little as possible.

Posted Feb 19, 2020 - 01:29 UTC

Resolved

This incident has been resolved.

Posted Feb 18, 2020 - 23:08 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 18, 2020 - 21:55 UTC

Update

We've implemented a hotfix to address part of the issue. It's helped improve service quality slightly. We're continuing to work on improving system performance.

Posted Feb 18, 2020 - 21:22 UTC

Update

We are continuing to work on a fix for this issue.

Posted Feb 18, 2020 - 20:30 UTC

Identified

We have identified the source of the errors. We are working to address the issue now.

Posted Feb 18, 2020 - 19:58 UTC

Update

We're continuing to see an increased error rate from the Cypress Dashboard. This is impacting the Dashboard UI as well as CI recording.

Posted Feb 18, 2020 - 19:46 UTC

Investigating

We are currently investigating this issue.

Posted Feb 18, 2020 - 19:38 UTC

This incident affected: Cloud.