Enduring Lighting Strikes and Service Outages in the Cloud
In early August, a lightning strike at a transformer owned by one of
Amazon’s power suppliers caused a major failure in Amazon Web Services’
European cloud. The power outage affected, among others, the Elastic
Compute Cloud (ECS) and Rational Database Service (RDS) cloud services.
According to the Amazon Web Services status page, a transformer from an energy supplier for one of the availability zones (EU-WEST-1 region) in Dublin was struck by lightning. An availability zone is a set of hardware that supports cloud services and that functions independently of other zones. According to the site, the cloud services were weakened by the impact. Since the cloud services are composed of complex software components, Amazon had to assign more hardware to restore its cloud services after the power had been restored.
One option, depending on a client’s business requirements, is for the
cloud provider to distribute its servers over multiple availability
zones (AZ). This redundancy provides excellent protection from localized
emergencies such as a lightning strike, power surge or other severe
weather related events.
For other customers, monitoring and frequent back ups can mitigate problems when incidents occur. With careful monitoring, engineers will be informed immediately when servers are unreachable. Then, using snapshots of all servers taken on an hourly basis, it is possible to boot the relevant servers in AZ’s that are available.
Several physical redundancies can also help to mitigate problems. These include redundant power supplies and backup generators that are tested to assure they will kick on when the power fails-unlike Amazon’s generators in Dublin. Using redundant Internet connections running simultaneously provides a backup if one provider fails or is performing poorly. Redundant hardware, such as multiple hard drives and other components, can be arranged so that if one fails, another immediately and seamlessly take its place.
According to the Amazon Web Services status page, a transformer from an energy supplier for one of the availability zones (EU-WEST-1 region) in Dublin was struck by lightning. An availability zone is a set of hardware that supports cloud services and that functions independently of other zones. According to the site, the cloud services were weakened by the impact. Since the cloud services are composed of complex software components, Amazon had to assign more hardware to restore its cloud services after the power had been restored.
This is the type of event that causes some
businesses to doubt the benefits of cloud computing and to believe that
it is not a safe or reliable method of hosting data. While these doubts
are understandable, the fact is there are a variety of techniques a
cloud computing company can apply to mitigate disruption caused by
weather events and resulting power outages.
For other customers, monitoring and frequent back ups can mitigate problems when incidents occur. With careful monitoring, engineers will be informed immediately when servers are unreachable. Then, using snapshots of all servers taken on an hourly basis, it is possible to boot the relevant servers in AZ’s that are available.
Several physical redundancies can also help to mitigate problems. These include redundant power supplies and backup generators that are tested to assure they will kick on when the power fails-unlike Amazon’s generators in Dublin. Using redundant Internet connections running simultaneously provides a backup if one provider fails or is performing poorly. Redundant hardware, such as multiple hard drives and other components, can be arranged so that if one fails, another immediately and seamlessly take its place.
0 comments:
Post a Comment