Some Users May Not be Able to Access the ATS

Incident Report for Applicant Tracking Software

Postmortem

Incident Summary: On July 22, 2024, our platform experienced three back to back periods of downtime due to an inefficient database call triggered by a partner's script running multiple times simultaneously. This overwhelmed our database resources faster than they could automatically scale.

Customer Impact: During the outages, which lasted a combined twelve minutes, users encountered a maintenance page advising that we were offline.

Root Cause: The root cause was identified as a series of database queries that our autoscaler could not handle promptly. This led to query timeouts and subsequent errors, causing the load balancer to mark app nodes as unhealthy and take them offline.

Resolution Steps: The ATS team promptly responded to the alerts, quickly identified the root cause, and took steps to prevent the issue from recurring. App nodes were promptly rebooted, the issue was swiftly communicated to the team and posted to the Status Page, all while Engineers investigated the resource issue and managed the scaling of the database.

Preventative Measures: In the short term, the database was significantly scaled up to handle the immediate demand. Moving forward, we will over-provision resources to ensure availability for large partner requests and avoid similar outages.

We deeply apologize for the inconvenience this outage caused our users. We understand the importance of our ATS in your daily operations and are committed to preventing such incidents in the future. Our team is dedicated to continuously improving our systems to provide a reliable and robust service.

Thank you for your patience and understanding.

If you have any further questions or concerns, please do not hesitate to reach out to our support team.

Posted Jul 22, 2024 - 16:21 PDT

Resolved

While we have plans to address the root cause of this issue through several additional fixes, we are confident that this first stage of repair will provide full system stability. The Applicant Tracking platform is operating as expected, and we do not anticipate further outages from this incident. Our vigilant team continues to keep a watchful eye on things to ensure a seamless experience and will be making further upgrades to improve reliability of access. If you have any questions or need assistance, our support team is always here to help.

Posted Jul 22, 2024 - 12:05 PDT

Monitoring

Our Engineering team has implemented a solution, and they are actively monitoring the Applicant Tracking platform to ensure a smooth login experience. There is still work to do to prevent this issue from reoccurring, so we might see additional downtime over the next 24 hours. If you encounter any difficulties accessing the platform, we recommend clearing your cache and cookies and then attempting to access the platform again. Should the issue persist, feel free to reach out to our support team. We're here, ready to assist if needed.

Posted Jul 22, 2024 - 10:04 PDT

Investigating

Some users may be experiencing trouble when logging in to their Applicant Tracking platform and be shown the message "We'll be back soon." Our Engineering team has put on their investigator hats and are currently rummaging around in the code, looking for clues.

Here's what we know so far:
-This is a high-priority incident
-We are investigating the cause of the issue and dedicating Engineering time and attention to solving the issue quickly
-The issue is affecting some users and appears to be intermittent

In the meantime, please:
-Refresh your Applicant Tracking System page periodically to see if the issue has been resolved.

Our team of tech wizards is diligently investigating the issue and working tirelessly to resolve it promptly. We appreciate your patience and understanding as we navigate this technical challenge.

Posted Jul 22, 2024 - 09:46 PDT

This incident affected: Account Access.