Public Cloud Status - [GRA][Public Cloud] - Object Storage S3 Standard/High perf

OVHcloud Public Cloud Status

Current status

Legend

Operational
Degraded performance
Partial Outage
Major Outage
Under maintenance

[GRA][Public Cloud] - Object Storage S3 Standard/High perf

Incident Report for Public Cloud

Postmortem

During a scale of the cluster (+20% capacity), one core service got flooded thus generated an increase of 50x http errors for our s3 High Perf and Standard users in GRA region.

Our technical teams have observed ~2% error rate from January 4th, 13:11 UTC to January 5th, 14:52 UTC, mitigated by team operations, until a snowball effect that leaded to several internal services crashes, from 14:52 UTC to 16:41 UTC, reaching up to ~80% error rate. From 16:41 UTC to 18:18 UTC, the average rate decreased to ~7%, until a configuration change.

The incident was due to a low cache timeout that was widely used on the cluster.
Once the issue found and the parameter has been fixed, the service quickly came back to nominal state and performances are now better than before the scale, as expected.

We would like to point out that the incident resolution was not a workaround, but a stable, permanent and definitive solution to the problem.

This never seen issue is now specifically monitored to prevent further disruption.

Posted Jan 10, 2024 - 14:59 UTC

Resolved

Start time : 04/01/2024 13:11 UTC
End time : 05/01/2024 19:26 UTC
Service impact : Customers might have experienced 500 and 503 errors
Our technical teams resolved the issue. All impacted services are now operational.

Posted Jan 08, 2024 - 08:59 UTC

Monitoring

The error rate has returned to its nominal status. The situation has been stabilized since 19:26 UTC.
Our teams will continue to monitor services.

Posted Jan 05, 2024 - 20:20 UTC

Update

Following the actions carried out by our technicians the situation has improved. Currently the error rate has stabilized. We are still in the process of reducing it.

Posted Jan 05, 2024 - 18:48 UTC

Update

A significant improvement of the services is ongoing but the situation is not actually nominal. We are still investigating for the root cause.

Posted Jan 05, 2024 - 17:22 UTC

Update

Update : A final fix is still being produced by our teams.
Update will be posted as significant progress is made.

Posted Jan 05, 2024 - 15:46 UTC

Update

Update : A configuration update generated an increase of 503 errors for 15 minutes. Situation is not back to normal.
Our technical teams are working on the issue.
Update will be posted as significant progress is made.

Posted Jan 05, 2024 - 11:52 UTC

Update

Our technical teams have updated a configuration on an internal component. The quality has been improving since then.
Our technical teams are working on the issue.
Update will be posted as significant progress is made.

Posted Jan 05, 2024 - 10:13 UTC

Identified

Start time : 04/01/2024 13:11 UTC
Service impact : Customers may experience 500 and 503 errors.
Root cause : Performance degradation due to overload.
Ongoing actions : Our technical teams are working on the issue. Update will be posted as significant progress is made.

Posted Jan 04, 2024 - 15:12 UTC

Investigating

Start time : 04/01/2024 13:11 UTC
Service impact : Customers may experience 500 and 503 errors
Ongoing actions : Investigating
Our technical teams are working on the issue. Update will be posted as significant progress is made.

Posted Jan 04, 2024 - 13:16 UTC

This incident affected: Storage || Object storage (GRA).