Postmortem Report: Load Balancer Configuration Inconsistency Causes System Outage

Issue Summary

On May 13, 2023, between 12.00 PM and 3.00 PM Eastern Daylight Time (EDT), our business encountered a web stack outage, which affected the accessibility of our website. During this time, users were unable to access our services. Approximately 80% of our users were affected by this outage.

Timeline

12.00 PM EDT: Our monitoring system detected that our website was down.
12.01 PM EDT: The on-call engineer received an alert and began investigating the issue.
12.04 PM EDT: The investigation focused on the web server, as it was the most likely cause of the issue.
12.20 PM EDT: The web server logs were analyzed, but no issues were found.
12.25 PM EDT: The database server was investigated, as it was a potential source of the problem.
12.45 PM EDT: The database server logs were analyzed, but no issues were found.
1.00 PM EDT: The network infrastructure was checked, and no issues were found.
1.15 PM EDT: The load balancer was investigated, as it was the only remaining component that had not been checked.
1.30 PM EDT: The load balancer logs were analyzed, and it was discovered that a recent update had caused an unexpected configuration change, resulting in the outage.
1.55 PM EDT: The incident was escalated to the senior engineering team for resolution.
2.30 PM EDT: The load balancer configuration was rolled back to the previous version, and the website was restored.
3.00 PM EDT: The issue was resolved, and the website was back to normal operation.

Root Cause and Resolution

The root cause of the outage was an unexpected configuration change in the load balancer caused by a recent update. This change caused the load balancer to stop distributing traffic evenly among the web servers, leading to increased traffic on a single server, which eventually caused the server to crash.

To resolve the issue, the senior engineering team rolled back the load balancer configuration to the previous version, which restored the balance of traffic distribution and prevented any further server crashes.

Corrective and Preventative Measures

To prevent similar incidents in the future, the following measures will be implemented:

Change management procedures will be updated to include more rigorous testing of updates to the load balancer configuration before they are applied.
The load balancer will be configured to send alerts in the event of an unexpected configuration change.
Additional monitoring will be added to the web servers to detect and respond to increased traffic.

In conclusion, this outage was caused by an unexpected configuration change in the load balancer, which was resolved by rolling back the configuration to the previous version. To prevent similar incidents from occurring in the future, changes will be made to the change management procedures, load balancer configuration, and monitoring of the web servers.

Dchedos's Blog