The cost of cloud resilience, are there alternatives?
If you want to run cloud applications fail-safe, you have to think about their architecture. A study compares different scenarios and we highlight insurance as cost-effective alternative.
Companies that move their applications to the cloud expect not only lower costs but above all greater reliability. Cloud-native applications, at least in theory, run on some virtual machine somewhere in the cloud.
If the VM dies, the application restarts in another VM – or several instances of the application are running, distributed across different VMs, so that one is always available somewhere. The fact that things don’t always run so smoothly in practice was shown, for example, by the major failure of AWS in December last year .
The Uptime Institute, which specializes in consulting, research and certification in the field of data centers, took a closer look at the scenario of a distributed, cloud-native application with regard to reliability. The question was: How much effort do you have to make to achieve what level of failsafety, and how much does that cost? Amazon’s AWS served as an example cloud, but the data should be roughly transferrable to other large hyperscalers.
Failsafe at all costs?
The study is based on the division into zones and regions, as used by all major cloud providers: regions such as us-east-1 or eu-central-1 are completely separate from other regions, and the failure of one region should not affect other regions .
Resources are not automatically replicated across different regions. Zones (e.g. us-east-1) are isolated locations within a region and are designed to take over workloads from one another in the event of an instance failure.
The Uptime Institute considers three scenarios:
- a virtual machine failure,
- a zone failure, and an
- entire region failure.
The report contains relevant compensation levels as well as the cost of the scenario:
- no protection - 0% cost premium vs. 29% compensation for 2-day outage
- zone failure protection - 43% cost premium vs. 46% compensation for 2-day outage
- region failure protection - 111% cost premium vs. 62% compensation for 2-day outage
Thus in the region protection scenario the cloud costs would be more than doubled.
It’s important to note that the cloud provider is only compensating any unavailable service usage, e.g. the customer is only getting back cloud credits. No business interruption losses or reputation damages are covered through cloud SLAs.
Multi-region and Multi-cloud setup are extreme costly
The most complex scenario can also handle the failure of an entire region. It is a so called active-active replication where the same IT setup is done in a 2nd zone. The same approach is also applicable for multi-cloud scenarios.
It becomes more expensive if the instances are distributed over two regions: Depending on whether the instance in the second region is already running or has yet to be ramped up, the additional costs are between 51 and 111 percent in the active-active scenario. The calculated availability is 99.999999 percent - that is less than one second of failure per year.
Most cloud architects build resilience architectures for failover scenarios in single regions. Regional failovers are only considered in very critical and highly sensitive applications as those would double the cloud costs.
However, companies directly depend on their cloud vendors and in case of any regional failure the customers have to invoke their backup architecture. Typically this would lead to a downtime of several days. For any online company this risk is catastrophic and could lead to extreme impacts on the business side.
Downtime insurance as a potential alternative
It is always possible to double the IT costs and invest more into resilience and active-active failover architectures. However, only few businesses have this deep pockets as the cloud costs are already one of the key cost drivers. Even online banks, insurers and other financial organizations are not investing into multi-region or multi-cloud setups unless the regulators will start demanding it.
Insurance can act as a potential solution to offer guarantees to customers and business partners in case of unavailability of services. If these agreements are contractually assured and backed by an insurance company this will create trust and financial resilience towards customers.
Such an insurance solution would pay out a specific pre-agreed amount to the Insured, in the event of a 3rd party Cloud downtime & impacted services.
The study by the Uptime Institute is available for download