Reliability is the ability of a system to remain operational over time, i.e., to offer the service it was designed for.

Cloud Hardware and software fails. In this note, we will try to find methods to analyze and predict when components fail, and how we can prevent this problem.

Defining the vocabulary

Reliability and Factors of Influence

Reliability is the probability that a system will perform its intended function without failure over a specified period of time. There are many factors that influence this value:

Errors: system is operational, but it produces wrong results, inducing unexpected results..
Faults: some part of the system is not working, and as a result some of its functionality might be compromised.
Failures: system is not working, it needs to be repaired.
Performance issues: system is operational, but it is not performing as expected, e.g. too slow.

Detecting and predicting failures

In usual Cloud system we monitor the system to gather information about its health:

Data collection: activity, errors, configuration about each node.
Storage: We store the data plus some metadata (usually as time series).
Prediction use some Machine Learning tools to predict these kinds of failures, so that we can proactively solve them (instead of just reactively).

Then in order to hide the failures, every part of the stack should provide high availability and redundancy: not just software but also:

Power and cooling
Network infrastructure
Scheduling, see Cluster Resource Management.
Storage.

Availability

$$ \text{Availability} = \frac{\text{Uptime}}{\text{Uptime} + \text{Downtime}} $$

MTTF: Mean Time To Failure

$$ \text{MTTF} = \frac{1}{r} $$

This definition does not include repair time, and assumes the failures are independent with each other.

$$ \text{MTTF} = \frac{\text{TIME}}{\text{FAILURES}} $$

MTBF: Mean Time Between Failures

$$ \text{MTBF} = \text{MTTF} + x $$

Yet, the availability is a little bit different, since we have a downtime of $x$:

$$ \text{Availability} = \frac{\text{MTTF}}{x + \text{MTTF}} = \frac{\text{MTTF}}{\text{MTBF}} $$

Basic Failure Analysis

In this section, we model the most common ways a system can fail, and what is their availability given some assumptions.

Every component is Necessary

If we assume to have $N$ components each with mean time to failure of $\frac{1}{r}$, then the system has a failure rate of $N \cdot r$, meaning the failure rate is multiplied by the number of components. This suggests that we can:

Ensure that each component is resistant (components have lower failure rate $r$)
We provide redundancy (higher $N$)

$$ \text{Availability}_{\text{system}} = 1 - \frac{N \cdot r \cdot x}{365 \cdot 24} $$

Only one is Necessary

We can model the number of working versions using Markov Chains, in a manner akin to what we have done in Queueing Theory for performance analysis.

$$ \text{MTTF}_{\text{system}} = \sum_{i=0}^{N -1} \frac{1}{(N - i)r_i} $$

The idea is that the first component fails at a rate of $Nr_{i}$, second one at $(N - 1)r_{i}$, and so on. This means before all components are down, you need that amount of time on average.

$$ \text{Availability}_{\text{system}} = 1 - (1 - \text{Availability}_{\text{component}})^N $$$$ \text{Availability}_{\text{component}} = 1 - \frac{r \cdot x}{365 \cdot 24} $$

Parallelization and Reliability

If we had a single component that processes at a rate $X$, then the time to process it in parallel is $M / X$ where $M$ is the data.

$$ \text{Speedup} = \frac{1}{(1 - P) + \frac{P}{N}} $$

Best Effort

In some applications, we do not want the service to be down, but it would be ok to be inconsistent (remember Cloud Storage#CAP Theorem). For example, the google search could be a best effort application, here we do not offer service guarantees, but do the best to do it.

Providing Fault Tolerance

We need some form of replication for better reliability and error correction. Sometimes sharding is done (remember Distributed file systems).

Data Striping

The idea is to distribute data among several disks so that it can be accessed in parallel, this is usually implemented at RAID level, see Devices OS#RAID, level 0. This is a easier way to decrease latency and increase throughput.

Fine and Coarse Grained Striping

Fine-grain and coarse-grain data striping are techniques used in parallel computing and storage systems to distribute data across multiple devices to improve performance and fault tolerance.

Fine-grained:

Definition: Data is divided into very small chunks and distributed across multiple disks or processing units.
Advantages:
- Improves load balancing, as all disks contribute to every I/O request.
- Provides high parallelism, making it well-suited for applications with many small, frequent data accesses.
Disadvantages:
- High overhead due to increased coordination among disks.
- May not be optimal for large sequential reads and writes.
Example: RAID 3, where data is split at the byte or bit level across disks, requiring all disks to participate in every operation.

Coarse-grained:

Definition: Data is divided into larger blocks before being distributed across multiple disks or processing units.
Advantages:
- Reduces overhead by allowing each disk to handle independent requests.
- Better suited for workloads with large sequential data accesses.
Disadvantages:
- Less effective load balancing for small requests since fewer disks participate per operation.
- May result in uneven disk utilization.
Example: RAID 5, where data is striped at the block level, allowing parallel access without requiring all disks to be involved in every operation.

Feature	Fine-Grain Striping	Coarse-Grain Striping
Chunk Size	Very small (bytes or bits)	Large (blocks or files)
Parallelism	High	Moderate
Suitability	Frequent small requests	Large sequential reads/writes
Overhead	High	Lower
Load Balancing	Good	Can be uneven

RAID levels

See Devices OS#RAID: you should be able to explain what every RAID level does.

Sharding

We can choose to have big or small shards.

Big shards: they are easier to manage, but if they fail, we lose a lot of data, or lots of data could be on the same disk creating a bottleneck, also called hotspot (if you have a lot of clients that attempt to read from this shard).
Small shards: they are harder to manage, since they are more, but they are more fault tolerant and have a better load balancing.

Cloud Reliability-20250526200345867 — Image from course Slides CCA ETHz 2025

Placing the blocks

This section mainly explores ideas presented in (Cidon et al. 2013).

Random Placing

Random block placement is a good idea, if we assume failures of blocks is uncorrelated with each other. But in reality it seems that failures are correlated (see paper (Cidon et al. 2013)), for example you could have some rack failures or network failures.

Frameworks like Distributed file systems HDFS do it one in the node that receives, one in the same rake, and other one in another node in the same cluster, which means they choose a deterministic distribution.

The paper shows that the probability of losing data becomes very high if you use random placing and you have many nodes (around 1000 nodes, you have almost 100%).

Copyset Replication

A copyset (see paper (Cidon et al. 2013)) is a set of nodes that contain all the blocks of the data. Then you do random replication. But this is a single point of failure, so you do random replication. But losing three servers makes you lose some data.

Copysets says: if I place a datablock to one node, then it must be placed in the same copyset as the other copies of the same block.

Place first block randomly
Then place following the copysets. This leads to better reliability against losing data when we have correlated failures. The problem here is that
when we lose data, we lose more data (all data),
another problem is that this method is slower in recovering data (since we need to read from the same servers).

They made the argument that:

Better to lose 1TB in 625 years than
Losing 1GB every year, using MTTF analysis as above.

Other Reliability Factors

Load Balancing

Look at Content Delivery Networks for more. Load balancing goes well with sharding Ability to balance load depends on the nature of jobs and data distribution

Micro-service & micro-shard tendencies go together
the smaller the components, the more we can balance and distribute the load

REST makes load balancing easier: • Requests are independent, neither the client nor the server keep state (to some extent) • Each request can be processed by a different machine: • Higher throughput by adding more machines • Higher resilience by adding more machines

Redundant Execution

This is a nice way to trim down tail latency. Basically you send the same request twice, so that if one machine finishes first, then you have the answer faster. It reduces the problem of stragglers, explored in (Dean & Barroso 2013).

Monitoring Tools

Many other techniques involved in maintaining availability:

Canaries: a large scale parallel job that has a bug can trigger a massive failure; run the job in a “canary” and only run at scale if it has no problems. Comes from canary in coal mines, you first send synthetic requests to check if it works. -> gradual rollout of big systems.
Watchdog timers: regularly check the responsiveness of a machine to identify the ones that may cause trouble
Integrity checks: regular integrity checks that complement standard techniques in hardware to make sure that data is not lost or corrupted; run by background jobs

Need good logging and analysis tools:

ML is often applied today to better understand problems and help predict them

References

[1] Dean & Barroso “The Tail at Scale” Vol. 56(2), pp. 74--80 2013

[2] Cidon et al. “Copysets: Reducing the Frequency of Data Loss in Cloud Storage” 2013 USENIX Annual Technical Conference (USENIX ATC 13) 2013

Defining the vocabulary#

Reliability and Factors of Influence#

Detecting and predicting failures#

Availability#

MTTF: Mean Time To Failure#

MTBF: Mean Time Between Failures#

Basic Failure Analysis#

Every component is Necessary#

Only one is Necessary#

Parallelization and Reliability#

Best Effort#

Providing Fault Tolerance#

Data Striping#

Fine and Coarse Grained Striping#

RAID levels#

Sharding#

Placing the blocks#

Random Placing#

Copyset Replication#

Other Reliability Factors#

Load Balancing#

Redundant Execution#

Monitoring Tools#

References#