Reliability is the ability of a system to remain operational over time, i.e., to offer the service it was designed for.

Cloud Hardware and software fails. In this note, we will try to find methods to analyze and predict when components fail, and how we can prevent this problem.

Defining the vocabulary

Availability

$$ \text{Availability} = \frac{\text{Uptime}}{\text{Uptime} + \text{Downtime}} $$

MTTF: Mean Time To Failure

$$ \text{MTTF} = \frac{1}{r} $$

This definition does not include repair time, and assumes the failures are independent with each other.

MTBF: Mean Time Between Failures

This considers the mean time of repair time. If we consider that the repair time is $x$ seconds, then the MTBF is just $x + \text{MTTF}$.

$$ \text{Availability} = \frac{\text{MTTF}}{x + \text{MTTF}} $$

Basic Failure Analysis

Every component is Necessary

If we assume to have $N$ components each with mean time to failure of $\frac{1}{r}$, then the system has a failure rate of $\frac{1}{N \cdot r}$, meaning the failure rate is multiplied by the number of components. This suggests that we can:

  • Ensure that each component is resistant (components have lower failure rate $r$)
  • We provide redundancy (higher $N$)
$$ \text{Availability}_{\text{system}} = 1 - \frac{N \cdot r \cdot x}{365 \cdot 24} $$

Only one is Necessary

We can model the number of working versions using Markov Chains, in a manner akin to what we have done in Queueing Theory for performance analysis.

$$ \text{MTBF}_{\text{system}} = \sum_{i=0}^{N -1} \frac{1}{(N - i)r_i} $$

And the failure rate of the system is the inverse of the MTBF. MTBF stands for Mean Time Between Failures

$$ \text{Availability}_{\text{system}} = 1 - (1 - \text{Availability}_{\text{component}})^N $$

Parallelization and Reliability

If we had a single component that processes at a rate $X$, then the time to process it in parallel is $M / X$ where $M$ is the data.

If I add more components, I can process it in time $M / (N \cdot X)$ but I made the failure rate higher.

Best Effort

In some applications, we do not want the service to be down, but it would be ok to be inconsistent (remember Cloud Storage#CAP Theorem). For example, the google search could be a best effort application, here we do not offer service guarantees, but do the best to do it.

Error, Faults and Failures

Definition of the topic

  • Error: we get a result but it is incorrect (system is “working”, but not as specified)
    • Should be detected
    • Typically corrected by the system
  • Fault: some part of the system is not working and, as a result, some of its functionality might be compromised
    • Should be detected
    • Typically compensated through redundancy
  • Failure: system is not working
    • A range of situations that require different solutions
    • It needs more troubleshooting, the reasons could be multiple (performance, network errors, data corruption etc…).

Detecting and predicting failures

In usual Cloud system we monitor the system to gather information about its health:

  • Data collection: activity, errors, configuration about each node.
  • Storage: We store the data plus some metadata (usually as time series).
  • Prediction use some Machine Learning tools to predict these kinds of failures, so that we can proactively solve them (instead of just reactively).

Then in order to hide the failures, every part of the stack should provide high availability and redundancy: not just software but also:

Providing Fault Tolerance

We need some form of replication for better reliability and error correction. Sometimes sharding is done (remember Distributed file systems).

Data Striping

The idea is to distribute data among several disks so that it can be accessed in parallel, this is usually implemented at RAID level, see Devices OS#RAID, level 0. This is a easier way to decrease latency and increase thoughput.

Fine and Coarse Grained Striping

Fine-grain and coarse-grain data striping are techniques used in parallel computing and storage systems to distribute data across multiple devices to improve performance and fault tolerance.

Fine-grained:

  • Definition: Data is divided into very small chunks and distributed across multiple disks or processing units.
  • Advantages:
    • Improves load balancing, as all disks contribute to every I/O request.
    • Provides high parallelism, making it well-suited for applications with many small, frequent data accesses.
  • Disadvantages:
    • High overhead due to increased coordination among disks.
    • May not be optimal for large sequential reads and writes.
  • Example: RAID 3, where data is split at the byte or bit level across disks, requiring all disks to participate in every operation.

Coarse-grained:

  • Definition: Data is divided into larger blocks before being distributed across multiple disks or processing units.
  • Advantages:
    • Reduces overhead by allowing each disk to handle independent requests.
    • Better suited for workloads with large sequential data accesses.
  • Disadvantages:
    • Less effective load balancing for small requests since fewer disks participate per operation.
    • May result in uneven disk utilization.
  • Example: RAID 5, where data is striped at the block level, allowing parallel access without requiring all disks to be involved in every operation.
Feature Fine-Grain Striping Coarse-Grain Striping
Chunk Size Very small (bytes or bits) Large (blocks or files)
Parallelism High Moderate
Suitability Frequent small requests Large sequential reads/writes
Overhead High Lower
Load Balancing Good Can be uneven

RAID levels

See Devices OS#RAID: you should be able to explain what every RAID level does.

Sharding

We can choose to have big or small shards.

  • Big shards: they are easier to manage, but if they fail, we lose a lot of data, or lots of data could be on the same disk creating a bottleneck.
  • Small shards: they are harder to manage, but they are more fault tolerant and have a better load balancing.

Placing the blocks

Random Placing

Random block placement is a good idea, if we assume failures of blocks is uncorrelated with each other. But in reality it seems that failures are correlated (see paper in 2013). Frameworks like Distributed file systems HDFS do it one in the node that receives, one in the same rake, and other one in another node in the same cluster.

Copyset Replication

A copyset is a set of nodes that contain all the blocks of the data. Then you do random replication. But this is a single point of failure, so you do random replication. But losing three servers makes you lose some data.

MinCopysets

Place replicas deterministically in copyset after placing one deterministically, this is why HDFS does the fixed second copy.

  • every time we lose data, we lose more data in this case compared to the previous technique.
  • Another drawback is that it is slower to recover data.