We want to know how to handle systems that have a large number of data. In previous lesson we have discovered how to quickly access and make Scalable systems with huge dimensions, see Cloud Storage. Object storage could store billions of files, we want to handle millions of petabyte files.

Designing DFSs

The Use Case

Remember that the size of the files where heavily limited for Cloud Storage. The physical limitation was due to the limited size of a single hard disk, which was usually in the order of the Terabytes. Here, we would like to easily store petabytes of data in a single file, for example big datasets. Another feature that should be easily supported is highly concurrent access to the filesystem, last but not least being able to set up permissions in the system.

Desiderata of distributed file systems

In this case we have a Filesystem. In 2004 google created his own FS. With hundreds or thousands of machines the systems are practically guaranteed to fail. This leads to a requirement of

fault tolerance
We need to detect these errors so we need monitoring and error detection
we would like the system to be self-regulatory and have the ability of automatic recovery

Efficiency in operations

HDFS is designed primarily for write-once, read-many access patterns typical of big data processing workloads. We will later see that MapReduce works quite well with this methods.

We want to be efficient in

Scanning the file
Appending information to file atomically So that we can have a temporary place for intermediate data or sensors or logs. This needs to be very efficient for appends. Immutability is ok in this case.

Latency should not be the bottleneck (with plenty of small files we are losing a lot of files to lookup) but throughput should be the bottleneck. This is why the Distributed File System that we will be studying is optimized for high-throughput data processing rather than low-latency access to small files.

Hadoop

Primarily:
• Distributed File System (HDFS)
• MapReduce
• Wide column store (HBase)

Starting form 188 in 2006 when the filesystem has been developed, now it should handle more about 100k files, about 600PB of data!

We will start first by looking at the logical model and then to the physical model, following Codd’s data independence idea (Codd 1970).

The logical model

In this case we have a file hierarchy. And we have block storage (also called chunk, split, shard, partition) instead of unstructured object storage we studied in Cloud Storage#Object Stores, which are seen as a single big object. These documents have a far larger block size, usually of 64-128MB. We have 64 for Google and 128 for Hadoop. In Hadoop, blocks are files, so we don’t have fragmentation on the filesystem. This is because on a network cable having many files takes more time than a single bigger file, because of the latency. of file disk seeks. There are also other reasons it might be advantageous having this block size:

Limit metadata on NameNode about every block
Client need to interact less with the NameNode.
Limit the number of TCP connections that the Client would need to open for each data block on the datanodes. Just few persistent TCPs are ok. See Livello di trasporto for TCP.

The architecture

With HDFS we have a centralized architecture where we have a main node, called Coordinator, Primary, Leader, NameNode and then secondary nodes called Worker, Secondary, Follower, DataNodes.

When we have a file, we split it into chunks and store it multiple times (3) into different machines. RAID is redundant in this case, so we won’t use it, see Devices OS#RAID for more information.

The physical model

Structure of the NameNode

Mapping of file -> Block.
Block -> Locations on nodes.
State of the blocks (corrupted updated). and permissions.
ACL information and filesystem information.

Structure of the DataNode

One important fact that the DataNode has to handle are the heartbeats. These are small communications that the DataNode sends to the NameNode that serve mainly two roles:

Liveness: to make known that everything is all right, that the node is still alive and running (peculiar observation is that also human communication frequency has something related to this), If no hearthbeat is receaved in 10 minutes (Shvachko et al. 2010), then the node is considered dead.
Occasionally, contain information about storage of blocks, like the ACKs for successful storage. It can also contain notification of block corruptions, so that the namenode knows it can add it to the list of nodes that contain that block. This is usually done once every 3 seconds but can be configured. Along side this pattern, the DataNode also sends every six hours (configurable) a full report of every block that it contains.

Another responsibility of the DataNode is the replication pipeline. When it receives a file, it should replicate it to the other nodes.

The file systems operations

Read

The client asks the NameNode for the blocks and DataNodes.
The NameNode answers with the list of each block of the file and the DataNodes that store it ordered from distance to the client.
Then the client asks the DataNodes for the files, which is usually a stream of bytes. It downloads every block in turn.

It is informative to think about the interaction between HBase and HDFS. […] Who is the client here? The RegionServer, which does co-habit with a DataNode. […] This makes accessing the KeyValues in future reads by the RegionServer extremely efficient, because the RegionServer can read the data locally without communicating with the NameNode: this is known as short-circuiting in HDFS.

Write

This operations is a little bit more complex and needs many machines to be organized with each other:

The client first locks the file that he’s writing.
The client asks the NameNode for each block a series of DataNodes with which he can initialize connection.
Then client instructs the given DataNodes to create the replication pipeline
Then client sends the block and waits for acks by the DataNodes (only this part is syncronous for the client I think).
This is done for every block of the file, when it’s finished the client tells the NameNode to release the lock.
After a bit of time the NameNode should receive information about correct storage by the DataNodes, and checks if every replica is ok, and gives ACK to the client. and it is finished…
Then if one gets corrupted the NameNode automatically regulates the replicas

Replication strategy

Blocks are replicated with some general guidelines in mind. There is first a notion of distance from nodes that is defined by a physical connection distance:

2: if the two nodes are in the same rack
4: if they are in different racks

There is a general guideline that each node should have at maximum a single replica, and each rack should have at maximum two replicas. Usually the node that processes the request stores a version, then it stores replicas in two nodes in a rack different from the one that is processing the current version. The number of replicas to store is handled by a parameter called replication factor.

Usually the replication is node in this manner:

In the same node that receives the request
On two different nodes on another rack
The fourth, if present, is on a random node.

The reason why two replicas are usually not put on the same node, is that it would become too concentrated.

High Availability

One clear pain point of the current design is the single point of failure of the system: if a NameNode corrupts the data or fails irrecoverably, then the whole system would become unusable. One clearly wants to prevent this from happens. This leads to the idea of logs and namespace files.

Snapshots

If the NameNode crashes, then everything could be lost, the whole cluster would be useless because we don’t have a way to recover the mappings between the block ids and the files. For this reason a snapshot (namespace file), which contains all the relevant information for a NameNode to work, is usually created and stored on some persistent storage or backed using Cloud Storage options.

Logs

We also need to balance the creation of snapshots and the number of updates. It would be quite infeasible to create a whole new snapshot after every update, yet at the same time we need to track every change to avoid to lose any.. This brings up the use of logs for every change, that act as a diff file for every single change that has not been saved in the snapshot (when we need to restore, we reapply every single registered change in order to return back to the same state, it usually takes 30 minutes to restore from a crash).

Checkpointing

When a merge happens this is called a checkpoint, which happens in a periodical fashion. (or could also be manually triggered). The merge is usually done by another node, called phantom node who merges the logs. If the phantom node is configured to take over the main node in case of failure, this is the StandbyNode, explored in a later section.

The snapshot and the logs is usually stored externally in a NAS or something similar (e.g. AWS Glacier, latency is OK for backups)

Special NameNodes

StandBy NameNodes

These NameNodes just keep the namespace file updated and up to date without interfering with the main NameNode. Their existence is for high availability of the service, making the system a little more reliable. They do not process write or reads. They just keep reading the edit log and updating the namespace file and the current state of the mappings.

There are also other proposals for merging the logs to the snapshot, so that it would not take half an hour to restart the NameNode in the case of a crash, these are called standby NameNodes that keep track of the current mappings between block ids and files.

Usually these types of NameNodes also receive heartbeats by the NameNodes.

Observer NameNode

This is very similar to the standby NameNode, but just handles the Read Request. Clients can connect to the Observer NameNode to know what blocks to read but not write. This is the main different with the StandBy NameNodes: they accept requests, but only reads.

Federated HDFS

In these case there are several NameNodes that run at the same time. Each of them has the responsibility for different directories in the filesystem hierarchy. The workload is spread into different nodes. I don’t know what happens when a NameNode fails and you cannot access some directories anymore.

Comparison with GFS

GFS is another distributed filesystem. The main difference is that the terms are different:

NameNode - Master
DataNode - Chunkserver
Block - Chunk
FS Image - Checkpoint image
Edit log - Operation log

Another difference is that the block size is smaller for GFS, in this case 64MB.

References

[1] Codd “A Relational Model of Data for Large Shared Data Banks” Vol. 13(6), pp. 377--387 1970

[2] Shvachko et al. “The Hadoop Distributed File System” 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) 2010

Designing DFSs#

The Use Case#

Desiderata of distributed file systems#

Efficiency in operations#

Hadoop#

The logical model#

The architecture#

The physical model#

Structure of the NameNode#

Structure of the DataNode#

The file systems operations#

Read#

Write#

Replication strategy#

High Availability#

Snapshots#

Logs#

Checkpointing#

Special NameNodes#

StandBy NameNodes#

Observer NameNode#

Federated HDFS#

Comparison with GFS#

References#