Reliability is the ability of a system to remain operational over time, i.e., to offer the service it was designed for.
Cloud Hardware and software fails. In this note, we will try to find methods to analyze and predict when components fail, and how we can prevent this problem.
Defining the vocabulary
Reliability and Factors of Influence
Reliability is the probability that a system will perform its intended function without failure over a specified period of time. There are many factors that influence this value:
- Errors: system is operational, but it produces wrong results, inducing unexpected results..
- Faults: some part of the system is not working, and as a result some of its functionality might be compromised.
- Failures: system is not working, it needs to be repaired.
- Performance issues: system is operational, but it is not performing as expected, e.g. too slow.
Detecting and predicting failures
In usual Cloud system we monitor the system to gather information about its health:
- Data collection: activity, errors, configuration about each node.
- Storage: We store the data plus some metadata (usually as time series).
- Prediction use some Machine Learning tools to predict these kinds of failures, so that we can proactively solve them (instead of just reactively).
Then in order to hide the failures, every part of the stack should provide high availability and redundancy: not just software but also:
- Power and cooling
- Network infrastructure
- Scheduling, see Cluster Resource Management.
- Storage.
Availability
$$ \text{Availability} = \frac{\text{Uptime}}{\text{Uptime} + \text{Downtime}} $$MTTF: Mean Time To Failure
$$ \text{MTTF} = \frac{1}{r} $$This definition does not include repair time, and assumes the failures are independent with each other.
$$ \text{MTTF} = \frac{\text{TIME}}{\text{FAILURES}} $$MTBF: Mean Time Between Failures
$$ \text{MTBF} = \text{MTTF} + x $$Yet, the availability is a little bit different, since we have a downtime of $x$:
$$ \text{Availability} = \frac{\text{MTTF}}{x + \text{MTTF}} = \frac{\text{MTTF}}{\text{MTBF}} $$Basic Failure Analysis
In this section, we model the most common ways a system can fail, and what is their availability given some assumptions.
Every component is Necessary
If we assume to have $N$ components each with mean time to failure of $\frac{1}{r}$, then the system has a failure rate of $N \cdot r$, meaning the failure rate is multiplied by the number of components. This suggests that we can:
- Ensure that each component is resistant (components have lower failure rate $r$)
- We provide redundancy (higher $N$)
Only one is Necessary
We can model the number of working versions using Markov Chains, in a manner akin to what we have done in Queueing Theory for performance analysis.
$$ \text{MTTF}_{\text{system}} = \sum_{i=0}^{N -1} \frac{1}{(N - i)r_i} $$The idea is that the first component fails at a rate of $Nr_{i}$, second one at $(N - 1)r_{i}$, and so on. This means before all components are down, you need that amount of time on average.
$$ \text{Availability}_{\text{system}} = 1 - (1 - \text{Availability}_{\text{component}})^N $$$$ \text{Availability}_{\text{component}} = 1 - \frac{r \cdot x}{365 \cdot 24} $$Parallelization and Reliability
If we had a single component that processes at a rate $X$, then the time to process it in parallel is $M / X$ where $M$ is the data.
$$ \text{Speedup} = \frac{1}{(1 - P) + \frac{P}{N}} $$Best Effort
In some applications, we do not want the service to be down, but it would be ok to be inconsistent (remember Cloud Storage#CAP Theorem). For example, the google search could be a best effort application, here we do not offer service guarantees, but do the best to do it.
Providing Fault Tolerance
We need some form of replication for better reliability and error correction. Sometimes sharding is done (remember Distributed file systems).
Data Striping
The idea is to distribute data among several disks so that it can be accessed in parallel, this is usually implemented at RAID level, see Devices OS#RAID, level 0. This is a easier way to decrease latency and increase throughput.
Fine and Coarse Grained Striping
Fine-grain and coarse-grain data striping are techniques used in parallel computing and storage systems to distribute data across multiple devices to improve performance and fault tolerance.
Fine-grained:
- Definition: Data is divided into very small chunks and distributed across multiple disks or processing units.
- Advantages:
- Improves load balancing, as all disks contribute to every I/O request.
- Provides high parallelism, making it well-suited for applications with many small, frequent data accesses.
- Disadvantages:
- High overhead due to increased coordination among disks.
- May not be optimal for large sequential reads and writes.
- Example: RAID 3, where data is split at the byte or bit level across disks, requiring all disks to participate in every operation.
Coarse-grained:
- Definition: Data is divided into larger blocks before being distributed across multiple disks or processing units.
- Advantages:
- Reduces overhead by allowing each disk to handle independent requests.
- Better suited for workloads with large sequential data accesses.
- Disadvantages:
- Less effective load balancing for small requests since fewer disks participate per operation.
- May result in uneven disk utilization.
- Example: RAID 5, where data is striped at the block level, allowing parallel access without requiring all disks to be involved in every operation.
Feature | Fine-Grain Striping | Coarse-Grain Striping |
---|---|---|
Chunk Size | Very small (bytes or bits) | Large (blocks or files) |
Parallelism | High | Moderate |
Suitability | Frequent small requests | Large sequential reads/writes |
Overhead | High | Lower |
Load Balancing | Good | Can be uneven |
RAID levels
See Devices OS#RAID: you should be able to explain what every RAID level does.
Sharding
We can choose to have big or small shards.
- Big shards: they are easier to manage, but if they fail, we lose a lot of data, or lots of data could be on the same disk creating a bottleneck, also called hotspot (if you have a lot of clients that attempt to read from this shard).
- Small shards: they are harder to manage, since they are more, but they are more fault tolerant and have a better load balancing.

Image from course Slides CCA ETHz 2025
Placing the blocks
This section mainly explores ideas presented in (Cidon et al. 2013).
Random Placing
Random block placement is a good idea, if we assume failures of blocks is uncorrelated with each other. But in reality it seems that failures are correlated (see paper (Cidon et al. 2013)), for example you could have some rack failures or network failures.
Frameworks like Distributed file systems HDFS do it one in the node that receives, one in the same rake, and other one in another node in the same cluster, which means they choose a deterministic distribution.
The paper shows that the probability of losing data becomes very high if you use random placing and you have many nodes (around 1000 nodes, you have almost 100%).
Copyset Replication
A copyset (see paper (Cidon et al. 2013)) is a set of nodes that contain all the blocks of the data. Then you do random replication. But this is a single point of failure, so you do random replication. But losing three servers makes you lose some data.
Copysets says: if I place a datablock to one node, then it must be placed in the same copyset as the other copies of the same block.
- Place first block randomly
- Then place following the copysets. This leads to better reliability against losing data when we have correlated failures. The problem here is that
- when we lose data, we lose more data (all data),
- another problem is that this method is slower in recovering data (since we need to read from the same servers).
They made the argument that:
- Better to lose 1TB in 625 years than
- Losing 1GB every year, using MTTF analysis as above.
Other Reliability Factors
Load Balancing
Look at Content Delivery Networks for more. Load balancing goes well with sharding Ability to balance load depends on the nature of jobs and data distribution
- Micro-service & micro-shard tendencies go together
- the smaller the components, the more we can balance and distribute the load
REST makes load balancing easier: • Requests are independent, neither the client nor the server keep state (to some extent) • Each request can be processed by a different machine: • Higher throughput by adding more machines • Higher resilience by adding more machines
Redundant Execution
This is a nice way to trim down tail latency. Basically you send the same request twice, so that if one machine finishes first, then you have the answer faster. It reduces the problem of stragglers, explored in (Dean & Barroso 2013).
Monitoring Tools
Many other techniques involved in maintaining availability:
- Canaries: a large scale parallel job that has a bug can trigger a massive failure; run the job in a “canary” and only run at scale if it has no problems. Comes from canary in coal mines, you first send synthetic requests to check if it works. -> gradual rollout of big systems.
- Watchdog timers: regularly check the responsiveness of a machine to identify the ones that may cause trouble
- Integrity checks: regular integrity checks that complement standard techniques in hardware to make sure that data is not lost or corrupted; run by background jobs
Need good logging and analysis tools:
- ML is often applied today to better understand problems and help predict them
References
[1] Dean & Barroso “The Tail at Scale” Vol. 56(2), pp. 74--80 2013
[2] Cidon et al. “Copysets: Reducing the Frequency of Data Loss in Cloud Storage” 2013 USENIX Annual Technical Conference (USENIX ATC 13) 2013