At approximately 2:00PM EST, we encountered a situation which exhausted our database connection pool. All sessions were stuck performing expensive work, which caused cascading problems as new requests attempted to interact with the database. This impacted both reads and writes to the production database, which was apparent both to end users viewing the Dashboard and CI jobs attempting to record Cypress tests.
During investigation, we identified a query that was causing a bottleneck for many requests. We flushed sessions that were executing this request to give us more headroom to investigate and restore the service. Around 3:30PM EST, we released a hotfix to mitigate this bottleneck. This restored service to acceptable levels.
After further assessing what led to the initial pool exhaustion, we opted to perform immediate maintenance on our production database. We were able to tune the system to prevent a similar failure from occurring in the near future. As of 4:00PM EST, we were back to fully-operational.
We're continuing to evaluate ways to further improve the performance of the Dashboard service. We plan to address many of these in our next development sprint (e.g. serving more requests from our read-replica).
We apologize for the inconvenience this caused you today. We know you rely on the Cypress Dashboard for your build processes, and we're focused on ensuring we interrupt those builds as little as possible.