Major Incident
Dear Customers,
As you may be aware, our hosting service provider, Heart Internet, recently suffered a major incident due to a power outage at their Leeds data centre on Wednesday afternoon.
Emergency maintenance work was being carried out on the load transfer module, which feeds power from their external energy supplies to the data centre hall that holds the majority of servers. The data centre has 2 dual feed uninterruptible supplies both backed by diesel generators in case of National Grid outages.
Unfortunately, a safety mechanism within the device triggered incorrectly, and resulted in a power outage of fewer than 9 minutes. Subsequently, this caused approximately 15,000 servers to be hard booted. Beyond a fire, this is the worst possible event that a hosting company can face. A full post mortem is currently being carried out to determine how power was lost on both supplies, despite working with the external engineer from the hardware manufacturer.
What happens when servers hard reboot?
Web servers and virtual servers typically perform database transactions at a very high rate, meaning that the risk of database or file system corruption is quite high when a hard reboot occurs.
Following the restoration of power, the first priority was to get the primary infrastructure boxes back online, then the managed and unmanaged platforms. The managed platforms are built to be resilient, so although we lost a number of servers in the reboot, the majority of the platforms came up cleanly. Some issues were encountered with Premium Hosting load balancers, which needed repairing, so some customer sites were off for longer than we would have hoped. Heart Internet are adding additional redundant load balancers and modifying the failover procedure over the next 7 days as an extra precaution for customers.
On the shared hosting platform, a number of NAS drives, which sit behind the front-end web servers and hold customer website data, crashed and could not be recovered. However, they are set up in fully redundant pairs and the NAS drives themselves contain 8+ disk RAID 10 arrays. In every case but one, at least one server in each pair came back up cleanly, or in an easily repairable state, and customer websites were back online within 2-3 hours.
In a single case, the cluster containing web 75-79, representing just under 2% of the shared platform, both NAS drives failed to come back up. Following the disaster recovery procedure, attempts commenced to restore the drives, whilst simultaneously building new NAS drives should they be required. Unfortunately, the servers gave a strong, but false, indication that they could be brought back into a functioning state, so attempts were prioritised to repair the file system.
Regrettably, following a ‘successful’ repair, performance was incredibly poor due to the damage to the file system, requiring a step up to the next rung of the disaster recovery procedure. The further into the disaster recovery process, the greater the recovery time, and this case we involved a total 4TB restore from on-site backups to new NAS drives. (For your information the steps following that are to restore from offsite backup and finally restore from tape backup, although these steps were not enacted.) At this point, it became apparent that the issue would take days rather than hours to resolve, and the status page was updated with an ETA. Sites were restored to the new NAS drives alphabetically in a read-only state and the restoration completed late Sunday afternoon.
A full shared cluster restore from backups to new NAS is a critical incident, and Heart routinely train their engineers on disaster recovery steps. The disaster recovery process functioned correctly, but because the event did not occur in isolation, they were unable to offer the level of individual service that might have been expected.
Given the magnitude of this event, Heart are currently investigating plans to split the platform and infrastructure servers across two data centre halls, which would allow for continuous running in the event of complete power loss to one. This added reliability is an extra step to put in place to ensure that this never happens again for customers.
Support and Communications
During this incident, customers were kept informed of progress through the status page at www.webhostingstatus.com
Given the scale of the issue, the load on Heart Internet’s Customer Services team was far in excess of normal levels. On a standard day, this service handles approximately 800 support tickets but following the incident, in excess of 5,000 new support tickets were received every day. After this, Heart took immediate steps to ameliorate the support load via automated updates to affected customers, but most of the tickets required in-depth investigation and server repairs that required a high level of technical capability, so could only be addressed by second line and sysadmin staff. It will take some time to clear the entire ticket backlog and restore normal ticket SLAs.
We would like to apologise to our customers. We know as much as anyone how important staying online is to your business. Heart Internet have assured us they will strive to offer good, uninterrupted service long into the future, as an utmost priority.