Power & Frailty
Part 1: Why do cloud computing services fail?
Cloud computing is - for the sake of communication - complex. It is complex in that there are more moving parts than you might initially believe. And anything with many components is going to have multiple points of failure, making it vulnerable to human error, software bugs, hardware degradation, deliberate sabotage, and even natural disasters.
The infamous February 2016 AWS S3 outage (lasting more than 4 hours) is just one (recent) example of how one service in one region from one service provider "broke the internet." Thousands of websites and technology services companies were affected because they relied on AWS S3 for storage needs. And when the service failed, these businesses were at the mercy of waiting for AWS S3 to recover. That's right, they could only wait (we'll come back to this later or jump to Part 2).
For those of you who follow the news, you already know that the cause of the U.S. East Region of S3 going down was due to nothing more than a human error. Apparently an unassuming employee typed one wrong line of code and took down the entire region. That's all it required.
But is it really surprising? At the end of the day, the cloud is just a bunch of computer hardware strung together over a network and placed somewhere far away from your business. It's a bunch of server farms (to simplify). And on this farm, there are farmers who have to make sure everything runs smoothly. There are routine updates to software and upgrades to hardware. There is a ton of migration work, scheduled maintenance, and retirement of legacy equipment. Oh, and all this fancy technology runs on electricity, which means even a failure in the power grid and generators can shut it all down.
Now consider what happens when this gigantic server farm needs to serve hundreds of thousands of customers, each with unique needs and various volumes of traffic tapping into these various resources. Add to that the fact that each business is using multiple services that need to coordinate with each other as well. We think you get the picture.
There's no reason to belabor the point that the cloud, for all its wonders and value, is ultimately a fragile system. This isn't even to account for hacking or natural disasters like earthquakes that can destroy cloud infrastructure, causing downtimes and ultimately interruptions to business customers.
If nothing else, know that while public cloud computing has made the lives of businesses a lot easier than running their IT infrastructure locally, cloud computing is ultimately a technology with vulnerabilities.
*Cloud providers usually designate availability ratings using a "9s" system. Actual downtimes, however, are significant departures from what is advertised and research shows that there can be as much as 15% to 20% longer outages in practice.
14m 24.0s 1h 40m 48.0s 7h 18m 17.5s 3d 15h 39m 29.5s
8h 45m 57.0s
Data courtesy of CloudEndure and Hosting Manual 2017.