Summary

A deployment of our authentication service left some customers unable to log in to our application for a period of 9 minutes. The issue was reported by Customer Support, 5 minutes before our alerting system detected the issue. We rolled back the bad deployment which restored service immediately. Approximately 50 users were impacted by issue, and we proactively reached out to them directly to apologise.


🔗 Important Links


ℹ️ Key Information


👪 Team

⏱️ Key Timestamps


⏳ Durations


Incident Timeline

<aside> 📆 All timestamps in UTC

</aside>

Date / Time Event
2023-03-03
09:36:09 The incident was automatically opened from an alert triggered by our monitoring system, Prometheus.
10:12:10 We confirmed no additional errors or alerts were firing and closed the incident

Contributors

<aside> ➕ What were the contributing factors for this incident? Think of these less as causes, and more as the set of conditions that had to manifest for this incident to occur, or reach the assigned severity.

</aside>

Summary Details
The on-call engineer was on the subway at the time We have a 15 minute response SLA for on-call engineers, and the engineer who was first responder for this incident was delayed on a subway train. This increased the time it took for us to act on resolving this incident.

Mitigators

<aside> ➖ What things prevented this incident from being worse? Think of these as the set of things that reduced the overall impact.

</aside>

Summary Details
Deployment transparency We recently rolled out some changes to connect our service deployments to the #deployments-pulse channel in Slack. We noticed early into the incident that the timestamps of errors correlated with the deployment of the authentication service, which helped us to make the decision to roll back more quickly.

Learnings and Risks

<aside> 🎓 What did we learn here, and what risks has this uncovered?

</aside>

Summary Details
Our process for requesting on-call overrides is too cumbersome The on-call engineer knew they’d be out of signal for >15 minutes, but the process for requesting an override was sufficiently cumbersome that they decided to take the risk rather than handing over to someone else.