Summary

A deployment of our authentication service left some customers unable to log in to our application for a period of 9 minutes. The issue was reported by Customer Support, 5 minutes before our alerting system detected the issue. We rolled back the bad deployment which restored service immediately. Approximately 50 users were impacted by issue, and we proactively reached out to them directly to apologise.

🔗 Important Links

💬 Slack channel
🌐 Incident homepage

ℹ️ Key Information

Incident Type: Platform
Severity: Minor

👪 Team

Incident Lead: Katie Hewitt
Reporter: Lucy Jennings
Active participants: Katie Hewitt, Lawrence Jones, Martha Lambert

⏱️ Key Timestamps

Reported: March 3, 2023 9:36 AM
Impact start: March 3, 2023 9:36 AM
Fixed: March 3, 2023 9:45 AM
Closed: March 3, 2023 10:12 AM

⏳ Durations

Incident duration: 36 minutes
Time to resolve: 9 minutes

Incident Timeline

<aside> 📆 All timestamps in UTC

</aside>

Date / Time	Event
2023-03-03
09:36:09	The incident was automatically opened from an alert triggered by our monitoring system, Prometheus.
…	…
…	…
…	…
10:12:10	We confirmed no additional errors or alerts were firing and closed the incident

Contributors

<aside> ➕ What were the contributing factors for this incident? Think of these less as causes, and more as the set of conditions that had to manifest for this incident to occur, or reach the assigned severity.

</aside>

Summary	Details
The on-call engineer was on the subway at the time	We have a 15 minute response SLA for on-call engineers, and the engineer who was first responder for this incident was delayed on a subway train. This increased the time it took for us to act on resolving this incident.
…	…

Mitigators

<aside> ➖ What things prevented this incident from being worse? Think of these as the set of things that reduced the overall impact.

</aside>

Summary	Details
Deployment transparency	We recently rolled out some changes to connect our service deployments to the #deployments-pulse channel in Slack. We noticed early into the incident that the timestamps of errors correlated with the deployment of the authentication service, which helped us to make the decision to roll back more quickly.
…	…

Learnings and Risks

<aside> 🎓 What did we learn here, and what risks has this uncovered?

</aside>

Summary	Details
Our process for requesting on-call overrides is too cumbersome	The on-call engineer knew they’d be out of signal for >15 minutes, but the process for requesting an override was sufficiently cumbersome that they decided to take the risk rather than handing over to someone else.
…	…