Power & Frailty
Part 2: The Myth of Painless Operational Recovery
When the cloud services that your business runs on suffer outages, the ultimate ramification is that certain parts of your operations are disrupted. Naturally, assuming you have set up your cloud architecture using best-practices, you'll go through a set of recovery and fail over protocols. And boom you're back up and running, right? Well... not exactly.
Operational recovery isn't done at the flip of a magic switch. The truth is, when the cloud fails and a businesses operation is interrupted, it actually takes quite some time to identify whether the cloud service provider is at fault or if the business operation stopped for other reasons. After eliminating self-induced issues, someone in your business will finally call up your cloud service provider, asking whether the cloud service(s) you use is actually down.
On the other side of the world, there's already a scramble happening to bring the service back up. Trust us when we tell you that it's sheer chaos (we used to work at major cloud computing companies). Unfortunately, the cloud engineers don't immediately know what exactly went wrong to cause the service outage. It takes time and diagnosis to identify the root cause of the problem. Then it takes even more time to fix the problem, which may include rebooting instances or sometimes even a reset of a whole region or sub-region of servers. That's why downtimes can last anywhere between a few unnoticeable minutes all the way to record-setting weeks.
And what are businesses doing while their cloud service providers are troubleshooting? Waiting. That's it. All they can do is wait. And bleed cash.
But wait, what about high-availability architecture? Isn't that enough to thwart the pains of potential downtimes? Read on, my friend.
Data Courtesy of Uptime Institute 2017.
UPS System Failure
Cyber Crimes (DDoS, etc.)
Water, Cooling, CRAC Failure
IT Equipment Failure
Cloud Outage Root-Cause Breakdown