Problem
Between 07:21 and 08:47, our API was unable to scale up to meet a temporary spike in demand, resulting in degraded performance for approximately 10% of our users. Affected users experienced slower response times or intermittent access to the API.
Cause
The issue was traced back to a misconfiguration in our autoscaling settings. Due to an error in resource allocation limits, the API was unable to provision additional server instances to handle the surge in traffic. This configuration issue led to resource saturation, preventing the system from scaling up in a timely manner.
Solution
We revised the autoscaling configuration to include additional resources and fine-tuned the autoscaling triggers to better respond to sudden increases in load. Additionally, we implemented enhanced monitoring to provide earlier detection of scaling issues, allowing for faster response to traffic surges. These updates aim to improve system resilience and minimize the risk of future disruptions.