[Global][Public Cloud] Autoscaling feature and nodepool CRD management in degraded state

Incident Report for Public Cloud

Resolved

Start Time : 05/07/2023 04:00 UTC
End Time : 05/07/2023 07:07 UTC

On July 5th 06:03 am CEST, OVHcloud has identified a partial unavailability of its services affecting Public Cloud and VPS customers.

During this time, the management of the following services was degraded: Compute, Storage, Network, Containers & orchestration. Access to data on Standard Object Storage Swift & Cloud Archive was not possible. Access to the Public Cloud segment of the OVHcloud manager was also unavailable. No data was lost during the incident.

Concerning VPS, ordering, upgrade, reinstallation and unsubscribe were unavailable as well as Snapshot and Automated backup features.

Resources of both Public Cloud and VPS remained available during the course of the incident, excluding Standard Object Storage Swift and Cloud Archive.

The incident was fixed at 9:07am CEST, thanks to our fully mobilized teams, with a progressive ramp-up to a nominal state up until 11:00am CEST for all affected services.

We sincerely apologize to all affected customers.

Posted Jul 07, 2023 - 09:22 UTC

Update

We are continuing to monitor for any further issues.

Posted Jul 05, 2023 - 16:25 UTC

Monitoring

On July 5th 06:03 am CEST, OVHcloud has identified a partial outage on its services affecting customers of the Public Cloud universe. Services including Control Panel, Kubernetes, Private Registry, VPS were notably and partially unavailable.

At 10:05am CEST, following actions from our fully mobilized technical teams, services were back to a nominal state concerning Control Panel, Kubernetes, Private Registry and VPS.

At 11:00am CEST, Cold Archive, Object Storage and PCI services were restored to nominal status.
We continue to actively monitor the situation with impacted services. We will communicate more information on the cause of the incident as our investigations progress.

We continue to actively monitor the situation with impacted services. We will communicate more information on the cause of the incident as our investigations progress.

We sincerely apologize to all affected customers.

Posted Jul 05, 2023 - 16:24 UTC

Identified

Updates : Management of Openstack ressources is back to normal. Cluster autoscaler component and nodepool customer resources management are still in a degraded state.

Our team is still working to stabilize theses components

Posted Jul 05, 2023 - 14:43 UTC

Monitoring

Start time : 2023/07/05 00:16 UTC
End Time : 05/07/2023 09:00 UTC
Ongoing actions : Monitoring
Our technical teams deployed a solution for the issue. We are monitoring the situation for the time being.

Posted Jul 05, 2023 - 09:24 UTC

Update

During investigations, Public Cloud shared the following information:
Existing Openstack resources are not impacted unless modification request sent to the Openstack API (load balancers, volumes, instances, etc.)
Creation or deletion of existing Openstack resources is impossible at the moment
Our team is monitoring actively all MKS services, we will fix them as soon as possible if they are concerned by this incident.

Posted Jul 05, 2023 - 08:11 UTC

Identified

Start time : 2023/07/05 00:16 UTC
End time : In progress
Service impact : Cluster autoscaler component and nodepool customer resources management are currently in a degraded state.
Root cause : Keystone API temporarily unreachable: https://public-cloud.status-ovhcloud.com/incidents/1shkj36zsphs
Ongoing actions : Waiting outage resolution from the Public Cloud team.

Posted Jul 05, 2023 - 07:08 UTC

This incident affected: Containers & Orchestration || Managed Kubernetes Service (BHS5, EU-WEST-LIM, GRA5, GRA7, GRA9, SBG5, SGP1, SYD1, WAW1, UK1, US-EAST-VA, US-WEST-OR) and Containers & Orchestration || Managed Private Registry (BHS, DE, GRA).