During a scale of the cluster (+20% capacity), one core service got flooded thus generated an increase of 50x http errors for our s3 High Perf and Standard users in GRA region.
Our technical teams have observed ~2% error rate from January 4th, 13:11 UTC to January 5th, 14:52 UTC, mitigated by team operations, until a snowball effect that leaded to several internal services crashes, from 14:52 UTC to 16:41 UTC, reaching up to ~80% error rate. From 16:41 UTC to 18:18 UTC, the average rate decreased to ~7%, until a configuration change.
The incident was due to a low cache timeout that was widely used on the cluster.
Once the issue found and the parameter has been fixed, the service quickly came back to nominal state and performances are now better than before the scale, as expected.
We would like to point out that the incident resolution was not a workaround, but a stable, permanent and definitive solution to the problem.
This never seen issue is now specifically monitored to prevent further disruption.