Cloud Monitoring System
Riskwolf uses bespoke monitoring platform and technology to measure the cloud situation and identify issues and downtimes.
In order to do this, following data categories are used:
- Official statements of ISPs and cloud providers (e.g. Azure, Amazon, Cloudflare, etc.)
- Own installations in cloud operators that allow us to get notifications when issues are detected
Riskwolf is able to generate end-to-end insights into the digital customer experience reflecting any downtimes and outages. Details on the possible pricing scenarios can be found in the technical pricing section.
Riskwolf operates a fleet of monitoring agents located in major cloud providers. Following is an example of such an agent network.
Measuring cloud service availability for IaaS
Using the agent network it is possible to determine the availability situation in any cloud region globally. The setup can be calibrated towards the insurer's needs.
The level of detail is given by the combination of:
- Cloud provider (e.g. Azure)
- Cloud region (e.g. europewest)
- Cloud service (e.g. Virtual Machine)
- Deployment setup (Availability zones, redundancy, etc.)
This information has to be provided from the policy holder in order to correctly determine the downtime status and calculate the claim amount.
To achieve high availability of the monitoring system, scan agents in every location are deployed in High Availability mode, where multiple running instances are spread across availability zones.
This will prevent monitoring outage in case of a Zone outage. To further mitigate the possibility of monitoring system outage, the downtime and unavailability is measured from multiple locations globally towards representative cloud agents running at the cloud provider.
Also the monitoring is always done from a different cloud provider (e.g. GCP/AWS monitors Azure, Azure/GCP monitors AWS, etc.).
The agent is usually deployed as Docker container which allows it to run in Docker or Kubernetes. Also on-prem if there is a specific need for it. Monitoring system stores collected data in highly available, zone-redundant storage solution. From there is data regularly collected and processed.
Following picture represents monitoring system architecture.
Example of a monitoring setup using three cloud regions:
- Frankfurt - AWS
eu-central
- Singapore GCP
asia-southeast-1
- Hongkong - AWS
ap-east-1
By default, every 15 minutes measures from at least 3 agents are taken to the target infrastructure of the cloud provider.
Depending on the service insured and clients deployment setup, monitoring can be focused on a particular availability zone - e.g. virtual machines running in GCP/Hongkong/Zone asia-east2-c
.
Using an aggregated view over all availability zones and regions it is possible to determine the uptime status for a particular cloud provider/region/service combination.
Riskwolf continuously collects and stores the data for audit purposes.
In case of outages, the system is automatically generating alerts and claim notifications as described below.
Detecting outages and downtimes
An outage is detected (see insured event) when one cloud service becomes unavailable at all insured cloud regions at the same point in time.
However, the measurement is taken individually at the level of a cloud region and service and will then be interpreted at policy level.
The event begins when the monitoring system reports a downtime of a service and the event ends when at least one defined service is available again.
A downtime is defined when more than 1/2 of scan agents detect that service is unreachable. Then service is classified as not available. Measures are taken every 15 minutes and the event duration is therefore always a fraction of an hour (1, 1.25, 1.5, 1.75, 2).
Example event for Azure westeurope Zone01 / VM:
On Apr, 27 05:15 UTC 2/3 agents detected an unavailability error (Receive - Operation timed out after 7000 milliseconds with 0 bytes received) and marked Azure VM, westeurope Zone 01 as offline. On Apr 27, 05:30 UTC 3/3 showed a green state again.
Two small events were detected where 2 of 3 agents reported a short downtime of 15 minutes.
Delay or increased latency are not qualifying for an insured event. The service has to be fully unavailable and unusable. Any interruption of the Internet infrastructure of the insured, or any interruption of the ISPs, networks between insured and the cloud provider are not included.
Measuring cloud service availability for IaaS (internal)
In addition, Riskwolf leverages the internal monitoring and notification systems of cloud providers to collect additional information about the root cause and additional problems.
There are different types of information that can be captured.
- Resource health checks
- Connectivity checks
- Incident notifications and root cause analysis
This information is used to assess in addition to details of larger incidents and outages. It is also leveraged in the pricing models.