A deployment of our authentication service left some customers unable to log in to our application for a period of 9 minutes. The issue was reported by Customer Support, 5 minutes before our alerting system detected the issue. We rolled back the bad deployment which restored service immediately. Approximately 50 users were impacted by issue, and we proactively reached out to them directly to apologise.
<aside> 📆 All timestamps in UTC
</aside>
Date / Time | Event |
---|---|
2023-03-03 | |
09:36:09 | The incident was automatically opened from an alert triggered by our monitoring system, Prometheus. |
… | … |
… | … |
… | … |
10:12:10 | We confirmed no additional errors or alerts were firing and closed the incident |
<aside> ➕ What were the contributing factors for this incident? Think of these less as causes, and more as the set of conditions that had to manifest for this incident to occur, or reach the assigned severity.
</aside>
Summary | Details |
---|---|
The on-call engineer was on the subway at the time | We have a 15 minute response SLA for on-call engineers, and the engineer who was first responder for this incident was delayed on a subway train. This increased the time it took for us to act on resolving this incident. |
… | … |
<aside> ➖ What things prevented this incident from being worse? Think of these as the set of things that reduced the overall impact.
</aside>
Summary | Details |
---|---|
Deployment transparency | We recently rolled out some changes to connect our service deployments to the #deployments-pulse channel in Slack. We noticed early into the incident that the timestamps of errors correlated with the deployment of the authentication service, which helped us to make the decision to roll back more quickly. |
… | … |
<aside> 🎓 What did we learn here, and what risks has this uncovered?
</aside>
Summary | Details |
---|---|
Our process for requesting on-call overrides is too cumbersome | The on-call engineer knew they’d be out of signal for >15 minutes, but the process for requesting an override was sufficiently cumbersome that they decided to take the risk rather than handing over to someone else. |
… | … |