Incident Summary: On July 22, 2024, our platform experienced three back to back periods of downtime due to an inefficient database call triggered by a partner's script running multiple times simultaneously. This overwhelmed our database resources faster than they could automatically scale.
Customer Impact: During the outages, which lasted a combined twelve minutes, users encountered a maintenance page advising that we were offline.
Root Cause: The root cause was identified as a series of database queries that our autoscaler could not handle promptly. This led to query timeouts and subsequent errors, causing the load balancer to mark app nodes as unhealthy and take them offline.
Resolution Steps: The ATS team promptly responded to the alerts, quickly identified the root cause, and took steps to prevent the issue from recurring. App nodes were promptly rebooted, the issue was swiftly communicated to the team and posted to the Status Page, all while Engineers investigated the resource issue and managed the scaling of the database.
Preventative Measures: In the short term, the database was significantly scaled up to handle the immediate demand. Moving forward, we will over-provision resources to ensure availability for large partner requests and avoid similar outages.
We deeply apologize for the inconvenience this outage caused our users. We understand the importance of our ATS in your daily operations and are committed to preventing such incidents in the future. Our team is dedicated to continuously improving our systems to provide a reliable and robust service.
Thank you for your patience and understanding.
If you have any further questions or concerns, please do not hesitate to reach out to our support team.