Cloud Computing:
Power & Frailty
For all its operational, scale, and cost advantages, cloud computing is an (un)surprisingly fragile technology.
Part 3: High-availability Cloud Architecture Just Isn't Enough
To mitigate suffering, businesses often set up high-availability (HA) cloud architecture. Typically this means they are running cloud services spread across multiple instances in different sub-regions and even regions. Some go as far (though very uncommon) to span their cloud architecture across multiple different cloud service providers. This way, when the cloud service(s) fails in one area, the business can just shift over to using a backup geography.
HA architecture is complex and difficult to achieve while being affordable. Higher availability requires more sophisticated architectural components that need to work in sync, cross communicate, monitor performance, detect problems, and failover gracefully to other instances and geographies when cloud outages occur. This isn’t free. The more resilient a cloud architecture, the more expensive it is to implement. In other words, resiliency and cost are directly correlated. Note however, that they are not necessarily proportionally correlated. At some point, spending on HA doesn’t pay off and that drop off point steepens dramatically over time and scale.
Jump to...



HA Expenditures
HA Benefit
This creates tension in terms of business decisions. Does your business over-spend on HA, thus incurring higher costs, but not experience enough outages each year to realize the actual benefits of said HA expenditures? Or does your business underspend on HA, thus leaving your operations exposed to financial risks of unanticipated downtimes? Most professionals agree that this optimization question is rather difficult to answer. The reason is fairly obvious: it is difficult to foresee exactly what types of outages would occur and how long the outages last, therefore, making the optimization point a an unpredictable and ever-moving point in the spectrum of cost management.
Spending on HA is usually a fixed cost that scales with the overall size of your cloud infrastructure. But the benefits don’t scale that way. In fact, benefits only are realized when outages happen. So you can easily overspend on HA and DR even in cases where you didn’t experience business disruption from outages. On the other hand, you can completely underestimate the number of outage incidents and length of downtimes after you’ve already spent on your HA and DR. When outages occur, your business cannot simply buy and implement HA and DR on the spot, and the existing redundancies and failovers currently in place were not enough to render a painless financial and operational recovery.
1) Over-pay for HA without reaping the commensurate benefits of a safer cloud operation, thus incurring unnecessary expenditures in the long run.
As more businesses migrate to the cloud, the competitive advantages afforded by the cloud will start to converge, meaning that businesses must move towards optimizing their expenditures on the cloud, including how they deal with the financial fallout of cloud outages as well.
We ask though, that as cloud computing becomes more commonly used to run businesses, and with increased reliance on it, can our economy and society really afford to dismiss the financial costs deriving from outages? Today, one hour of downtime here or there may not cost very much (actually it does), but tomorrow that same hour of downtime will cost even more (and it will).
It's time we complement the technical and preventative measures with a financial solution that makes business sense.
RPO
RTO

High-availability Cost-Benefit Model
$
Scale
Optimal ?


Value Surplus
Inefficiencies
There is a point at which spending on HA just isn't worth the value anymore. In practice, it is exceedingly difficult to pinpoint where the optimal scale is. As a result, most businesses typically overspend on HA, thus generating waste, or under utilize HA, thus exposing their businesses to downtime and financial risk.

Short RPO + Short RTO = Highest Cost
Long RPO + Long RTO = Lowest Cost
Long RPO + Short RTO = Medium Cost
Short RPO + Long RTO = Medium Cost
Recovery Time Objective (RTO) is the duration of time and a service level within which a business process must be restored after a disaster in order to avoid unacceptable consequences associated with a break in continuity.
Recovery Point Objective (RPO) describes the interval of time that might pass during a disruption before the quantity of data lost during that period exceeds the Business Continuity Plan’s maximum allowable threshold or “tolerance.”
The lower RTO and RPO time elapsed, the lower the cost of business impact you experience from a cloud outage. However, to achieve a lower RTO and RPO, you need to spend more on both the backup systems and on administrative overhead.
Costs of Disaster Recovery
Outage Incident

Cost ($)
Time
Consequently, two behaviors manifest, depending on what type or size of business makes the decision. Enterprises typically have deeper pockets and would rather knowingly over-pay for HA and mitigate whatever cloud-related downtimes may occur. Small and medium businesses (SMBs), on the other hand, have more constrained financial resources, and existing research tells us that most SMBs actually under-pay for HA, leaving their operations exposed to the financial risks deriving from cloud service outages.
2) Under-spend on HA/DR due to resource constraints, thus leaving business operations over exposed to financial risks of cloud outages.
Costs of Outage

HA/DR Expenses

Risk of Outage

Overall Costs
Costs of Outage
Inefficiencies
Optimal
Risk of Outage

HA/DR Expenses

Over Exposure
