On March 12th 2025 at 18:15 UTC, Public Cloud customers experienced incapacity to administrate their OVHcloud Services both through the manager and the API. This situation has impacted our clients globally.
OVHCloud US infrastructure was isolated from the incident and remained available.
The cause of this incident is a saturation in our authentication service following the removal of entries and service in the context of Heat service end of life.
The following impacts were identified:
User Access:
Authentication failures for the OpenStack dashboard (Horizon) prevented users from logging in and managing their instances.
Service Disruption:
API requests that require valid tokens were unsuccessful, affecting resource management operations such as instance creation, deletion, and modification or S3 bucket access through API and OVHcloud manager.
Automated systems and orchestration tools that rely on the authentication service were unable to function properly.
Existing resources like instances have not been affected by this incident.
Operational Delays:
Administrative tasks and management operations were delayed due to inability to access necessary services. In example, volume snapshots or backups.
Some scheduled operations or scripts depending on the authentication service may have encountered errors or failed to execute.
Inter-Service Communication:
Services that depend on the authentication service to validate the identity and trust relationships encountered communication issues, potentially leading to partial service degradation.
Our team identified the issue quickly and started to work on it right away. Rollback procedure has been activated and put in place within the first 30 minutes.
Unfortunately, the rollback plan was not a success. Invalid configuration in our database, following our rollback plan, caused the authentication service to remain unavailable for another hour.
Investigation of our team confirmed the invalid configuration and put a fix in place at 19:45 UTC. After this fix, regions excluding the Graveline physical datacenter (GRA*), started to accept authentications and allowing services to work as expected.
The Gravelines region experienced another connectivity issue for 1 hour 30 minutes. Our load balancers were not able to handle connections properly between our services causing this region to remain unavailable.
At 21:25 UTC, all services were now available for all regions.
12/03/2025 - 18:15 UTC - Beginning of incident
12/03/2025 - 19:00 UTC - Load balancers configuration change to reduce the pressure of request on our authentication service
12/03/2025 - 19:29 UTC - Rollback of the change with valid configuration
12/03/2025 - 19:43 UTC - Root cause identified and fix in progress
12/03/2025 - 20:14 UTC - All regions excluding the Graveline physical datacenter (GRA*) are now available. Load balancers configuration from 19:00 UTC generated further connectivity issue on Gravelines regions
12/03/2025 - 20:15 UTC - Issue escalated to the Network Team
12/03/2025 - 21:24 UTC - End of Incident following network configuration change
All required analysis are currently in progress.
Further action items will be added in this page as soon as possible.