We want to know how to handle systems that have a large number of data. In previous lesson we have discovered how to quickly access and make Scalable systems with huge dimensions, see Cloud Storage. Object storage could store billions of files, we want to handle millions of petabyte files.

Desiderata of distributed file systems 🟩

In this case we have a Filesystem. In 2004 google created his own FS. With hundreds or thousands of machines the systems are practically guaranteed to fail. This leads to a requirement of fault tolerance. We need to detect these errors so we need monitoring and error detection and we would like the system to be self-regulatory and have the ability of automatic recovery

Efficiency in operations 🟨–

HDFS is designed primarily for write-once, read-many access patterns typical of big data processing workloads. We will later see that MapReduce works quite well with this methods.

We want to be efficient in

  • Scanning the file
  • Appending information to file atomically So that we can have a temporary place for intermediate data or sensors or logs. This needs to be very efficient for appends. Immutability is ok in this case.

Latency should not be the bottleneck (with plenty of small files we are losing a lot of files to lookup) but throughput should be the bottleneck. This is why the Distributed File System that we will be studying is optimized for high-throughput data processing rather than low-latency access to small files.

Hadoop

Primarily:
β€’ Distributed File System (HDFS)
β€’ MapReduce
β€’ Wide column store (HBase)

Starting form 188 in 2006 when the filesystem has been developed, now it should handle more about 100k files, about 600PB of data!

We will start first by looking at the logical model and then to the physical model, following Codd’s data independence idea (Codd 1970).

The logical model

In this case we have a file hierarchy. And we have block storage (also called chunk, split, shard, partition) instead of unstructured object storage we studied in Cloud Storage#Object Stores, which are seen as a single big object. These documents have a far larger block size, usually of 64-128MB. We have 64 for Google and 128 for Hadoop. In Hadoop, blocks are files, so we don’t have fragmentation on the filesystem. This is because on a network cable having many files takes more time than a single bigger file, because of the latency. of file disk seeks. There are also other reasons it might be advantageous having this block size:

  • Limit metadata on NameNode about every block
  • Client need to interact less with the NameNode.
  • Limit the number of TCP connections that the Client would need to open for each data block on the datanodes. Just few persistent TCPs are ok. See Livello di trasporto for TCP.

The architecture 🟩

With HDFS we have a centralized architecture where we have a main node, called Coordinator, Primary, Leader, NameNode and then secondary nodes called Worker, Secondary, Follower, DataNodes.

When we have a file, we split it into chunks and store it multiple times (3) into different machines. RAID is redundant in this case, so we won’t use it, see Devices OS#RAID for more information.

The physical model

Structure of the NameNode πŸŸ₯++

  1. Memory
  2. Mapping of file -> Block.
  3. Tracking block locations
  4. State of the blocks (corrupted updated). and permissions.

More in particular:

  • the file namespace, that is, the hierarchy of directory names and file names, as well as any access control (ACL) information similar to Unix-based systems.
  • a mapping from each file to the list of its blocks. Each block, in this list, is represented with a 64-bit identifier; the content of the blocks is not on the NameNode. Blocks are local files.
  • a mapping from each block, represented with its 64-bit identifier, to the locations of its replicas, that is, the list of the DataNodes that store a copy of this block.

The NameNode never initiates connection with the DataNode. It can just answer to DataNodes’ heartbeats.

Structure of the DataNode 🟩–

One important fact that the DataNode has to handle are the heartbeats. These are small communications that the DataNode sends to the NameNode to make known that everything is all right (peculiar observation is that also human communication frequency has something related to this), this heartbeat could sometimes also contain information about storage of blocks, like the ACKs for successful storage. It can also contain notification of block corruptions. This is usually done once every 3 seconds but can be configured. Along side this pattern, the DataNode also sends every six hours (configurable) a full report of every block that it contains.

Another responsibility of the DataNode is the replication pipeline. When it receives a file, it should replicate it to the other nodes.

The file systems operations

Read 🟩
  • The client asks the NameNode for the blocks and DataNodes.
  • The NameNode answers with the list of each block of the file and the DataNodes that store it ordered from distance to the client.
  • Then the client asks the DataNodes for the files, which is usually a stream of bytes. It downloads every block in turn.
Write 🟩

This operations is a little bit more complex and needs many machines to be organized with each other:

  • The client asks the NameNode for each block a series of DataNodes with which he can initialize connection.
  • Then client instructs the given DataNodes to create the replication pipeline
  • Then client sends the block and waits for acks by the DataNodes
  • This is done for every block of the file, when it’s finished the client tells the NameNode to release the lock.
  • After a bit of time the NameNode should receive information about correct storage by the DataNodes, and checks if every replica is ok, and gives ACK to the client.
  • Then if one gets corrupted the NameNode automatically regulates the replicas

Replication strategy 🟩–

Blocks are replicated with some general guidelines in mind. There is first a notion of distance from nodes that is defined by a physical connection distance:

  • 2: if the two nodes are in the same rack
  • 4: if they are in different racks

There is a general guideline that each node should have at maximum a single replica, and each rack should have at maximum two replicas. Usually the node that processes the request stores a version, then it stores replicas in two nodes in a rack different from the one that is processing the current version. The number of replicas to store is handled by a parameter called replication factor.

Snapshot and logs πŸŸ₯++

If the NameNode crashes, then everything could be lost, the whole cluster would be useless because we don’t have a way to recover the mappings between the block ids and the files. For this reason a snapshot (namespace file) is usually created. We also need to balance the creation of snapshots and the number of updates. It would be quite infeasible to create a whole new snapshot after every update, yet at the same time we need to track every change to avoid to lose any. This brings up the use of logs for every change (when we need to restore, we reapply every single registered change in order to return back to the same state, it usually takes 30 minutes to restore from a crash). The snapshot and the logs is usually stored externally in a NAS or something similar (e.g. AWS Glacier, latency is OK for backups)

TODO: add when snapshots are created, when compacted.

Special NameNodes

StandBy NameNodes 🟩–

These NameNodes just keep the namespace file updated. They do not process write or reads. They just keep reading the edit log and updating the namespace file and the current state of the mappings. This is usually a better version for reliability compared to the previous architecture. There are also other proposals for merging the logs to the snapshot, so that it would not take half an hour to restart the NameNode in the case of a crash, these are called standby NameNodes that keep track of the current mappings between block ids and files.

Observer NameNode 🟨++

This is very similar to the standby NameNode, but just handles the Read Request. Clients can connect to the Observer NameNode to know what blocks to read. This is the main different with the StandBy NameNodes.

Federated HDFS 🟩

In these case there are several NameNodes that run at the same time. Each of them has the responsibility for different directories in the filesystem hierarchy. The workload is spread into different nodes. I don’t know what happens when a NameNode fails and you cannot access some directories anymore.

Comparison with GFS 🟨

GFS is another distributed filesystem. The main difference is that the terms are different:

  1. NameNode - Master
  2. DataNode - Chunkserver
  3. Block - Chunk
  4. FS Image - Checkpoint image
  5. Edit log - Operation log

Another difference is that the block size is smaller for GFS, in this case 64MB.

References

[1] Codd β€œA Relational Model of Data for Large Shared Data Banks” Communications of the ACM Vol. 13(6), pp. 377–387 1970