OVHcloud Public Cloud Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance

[EU/CA/APAC][Public Cloud] - Keystone API incident notification

Incident Report for Public Cloud

Postmortem

SUMMARY

On March 12th 2025 at 18:15 UTC, Public Cloud customers experienced incapacity to administrate their OVHcloud Services both through the manager and the API. This situation has impacted our clients globally.
OVHCloud US infrastructure was isolated from the incident and remained available.

The cause of this incident is a saturation in our authentication service following the removal of entries and service in the context of Heat service end of life.

The following impacts were identified:

User Access:

Authentication failures for the OpenStack dashboard (Horizon) prevented users from logging in and managing their instances.

Service Disruption:

API requests that require valid tokens were unsuccessful, affecting resource management operations such as instance creation, deletion, and modification or S3 bucket access through API and OVHcloud manager.

Automated systems and orchestration tools that rely on the authentication service were unable to function properly.

Existing resources like instances have not been affected by this incident.

Operational Delays:

Administrative tasks and management operations were delayed due to inability to access necessary services. In example, volume snapshots or backups.

Some scheduled operations or scripts depending on the authentication service may have encountered errors or failed to execute.

Inter-Service Communication:

Services that depend on the authentication service to validate the identity and trust relationships encountered communication issues, potentially leading to partial service degradation.

Our team identified the issue quickly and started to work on it right away. Rollback procedure has been activated and put in place within the first 30 minutes.

Unfortunately, the rollback plan was not a success. Invalid configuration in our database, following our rollback plan, caused the authentication service to remain unavailable for another hour.

Investigation of our team confirmed the invalid configuration and put a fix in place at 19:45 UTC. After this fix, regions excluding the Graveline physical datacenter (GRA*), started to accept authentications and allowing services to work as expected.

The Gravelines region experienced another connectivity issue for 1 hour 30 minutes. Our load balancers were not able to handle connections properly between our services causing this region to remain unavailable.

At 21:25 UTC, all services were now available for all regions.

TIMELINE

12/03/2025 - 18:15 UTC - Beginning of incident

12/03/2025 - 19:00 UTC - Load balancers configuration change to reduce the pressure of request on our authentication service

12/03/2025 - 19:29 UTC - Rollback of the change with valid configuration

12/03/2025 - 19:43 UTC - Root cause identified and fix in progress

12/03/2025 - 20:14 UTC - All regions excluding the Graveline physical datacenter (GRA*) are now available. Load balancers configuration from 19:00 UTC generated further connectivity issue on Gravelines regions

12/03/2025 - 20:15 UTC - Issue escalated to the Network Team

12/03/2025 - 21:24 UTC - End of Incident following network configuration change

ACTION PLAN

All required analysis are currently in progress.

Further action items will be added in this page as soon as possible.

Posted Mar 17, 2025 - 16:49 UTC

Resolved

We would like to inform you that the incident on our Public cloud offering, which was causing temporary latency issues in EU/CA/APAC region. has now been resolved.

Here is detail for this incident :
Start time : 12/03/2025 18:15 UTC
End time : 12/03/2025 21:25 UTC
Root cause : A malfunction was detected in a service used by many products.

We thank you for your understanding and patience throughout this incident.
Posted Mar 12, 2025 - 23:14 UTC

Monitoring

A fix has been implemented, and our teams are actively monitoring the results.
The network traffic has returned to a nominal condition.

We thank you for your patience during the course of this incident.
Posted Mar 12, 2025 - 21:40 UTC

Update

Update : The authentication service is temporarily unavailable. Clients attempting to connect to their services on the public cloud offer may encounter an error since 12/03/2025 18:15 UTC.
Ongoing Actions : The incident has been identified and our teams are mobilised to restore service as quickly as possible.

We will keep you updated on the progress and resolution.
We apologize for any inconvenience caused and appreciate your understanding.
Posted Mar 12, 2025 - 19:55 UTC

Identified

We are currently experiencing an incident affecting our Public cloud offering, which is causing temporary availability issues in EU/CA/APAC region.
Update : Keystone (authentication) API are temporarily unavailable for OpenStack - Object Storage - Managed Kubernetes Service - Load Balancer - AI & Machine Learning - Managed Private Registry .
The Managed Private Registry service is operational. Pull and Push are temporarily unavailable.

Here is detail for this incident :
Start time : 12/03/2025 18:15 UTC
Impacted Service(s) : Keystone (authentication) API are temporarily unavailable for OpenStack - Object Storage - Managed Kubernetes Service - Load Balancer - AI & Machine Learning - Managed Private Registry - Managed Private Registry.
Customers Impact : Clients using the Keystone API (authentication) are temporarily unavailable for OpenStack - Object Storage - Managed Kubernetes Service - Load Balancer - AI & Machine Learning - Managed Private Registry for EU/CA/APAC region.
Ongoing Actions : The incident has been identified and our teams are mobilised to restore service as quickly as possible.

We will keep you updated on the progress and resolution.
We apologize for any inconvenience caused and appreciate your understanding.
Posted Mar 12, 2025 - 19:40 UTC

Update

We are currently experiencing an incident affecting our Public cloud offering, which is causing temporary latency issues in EU/CA/APAC region.
Update : Keystone (authentication) API has an increased error response rate for OpenStack - Object Storage - Managed Kubernetes Service - Load Balancer and AI & Machine Learning.

Here is detail for this incident :
Start time : 12/03/2025 18:15 UTC
Impacted Service(s) : Keystone (authentication) API has an increased error response rate for OpenStack - Object Storage - Managed Kubernetes Service - Load Balancer and AI & Machine Learning.
Customers Impact : Clients using the Keystone API (authentication) are temporarily experiencing an increased error response rate for OpenStack - Object Storage - Managed Kubernetes Service - Load Balancer and AI & Machine Learning for EU/CA/APAC region.
Ongoing Actions : Our teams are investigating to determine the origin of the incident to fix it.

We will keep you updated on the progress and resolution.
We apologize for any inconvenience caused and appreciate your understanding.
Posted Mar 12, 2025 - 19:08 UTC

Update

We are currently experiencing an incident affecting our Public cloud offering, which is causing temporary latency issues in EU/CA/APAC region.

Here is detail for this incident :
Start time : 12/03/2025 18:15 UTC
Impacted Service(s) : Keystone (authentication) API has an increased error response rate for OpenStack - Object Storage and Managed Kubernetes Service.
Customers Impact : Clients using the Keystone API (authentication) are temporarily experiencing an increased error response rate for OpenStack - Object Storage and Managed Kubernetes Service for EU/CA/APAC region.
Ongoing Actions : Our teams are investigating to determine the origin of the incident to fix it.

We will keep you updated on the progress and resolution.
We apologize for any inconvenience caused and appreciate your understanding.
Posted Mar 12, 2025 - 18:50 UTC

Update

We are continuing to investigate this issue.
Posted Mar 12, 2025 - 18:49 UTC

Investigating

We are currently experiencing an incident affecting our Public cloud offering, which is causing temporary latency issues in EU/CA/APAC region.

Here is detail for this incident :
Start time : 12/03/2025 18:15 UTC
Impacted Service(s) : Keystone (authentication) API has an increased error response rate for OpenStack - Object Storage and Managed Kubernetes Service.
Customers Impact : Clients using the Keystone API (authentication) are temporarily experiencing an increased error response rate for OpenStack - Object Storage and Managed Kubernetes Service for EU/CA/APAC region.
Ongoing Actions : Our teams are investigating to determine the origin of the incident to fix it. 

We will keep you updated on the progress and resolution.
We apologize for any inconvenience caused and appreciate your understanding.
Posted Mar 12, 2025 - 18:40 UTC
This incident affected: Containers & Orchestration || Managed Private Registry (GRA, DE, BHS, VA), Compute - Instance || SGP (SGP1, SGP2), Containers & Orchestration || Managed Kubernetes Service (BHS5, DE1, GRA5, GRA7, GRA9, GRA11, SBG5, SGP1, SYD1, WAW1, UK1), AI & Machine Learning || AI Notebooks (BHS, GRA), Compute - Instance || WAW (WAW1), AI & Machine Learning || AI Deploy (BHS, GRA), Compute - Instance || GRA (GRA1, GRA3, GRA5, GRA7, GRA9, GRA11), Compute - Instance || LIM (DE1), Compute - Instance || SYD (SYD1, AP-SOUTHEAST-SYD-2), Compute - Instance || ERI (UK1), Compute - Instance || BHS (BHS1, BHS3, BHS5), AI & Machine Learning || AI Training (BHS, GRA), Storage || Object storage (BHS, GRA, DE, RBX, SBG, SGP, SYD, UK, WAW, YYZ, LIM, EU-WEST-PAR-A, EU-WEST-PAR-B, EU-WEST-PAR-C), AI & Machine Learning || AI Dashboard, Containers & Orchestration || Load Balancer (BHS5, DE1, GRA5, GRA7, GRA9, GRA11, SBG5, SGP1, SYD1, WAW1, UK1), Storage || Cold Archive (RBX-ARCHIVE), Compute - Instance || RBX (RBX-A), Compute - Instance || India (AP-SOUTH-MUM-1), Compute - Instance || EU-WEST-PAR (EU-WEST-PAR-A, EU-WEST-PAR-B, EU-WEST-PAR-C), and Compute - Instance || SBG (SBG5, SBG7).