We want to know how to handle systems that have a large number of data. In previous lesson we have discovered how to quickly access and make Scalable systems with huge dimensions, see Cloud Storage. Object storage could store billions of files, we want to handle millions of petabyte files.
Designing DFSs
The Use Case
Remember that the size of the files where heavily limited for Cloud Storage. The physical limitation was due to the limited size of a single hard disk, which was usually in the order of the Terabytes. Here, we would like to easily store petabytes of data in a single file, for example big datasets. Another feature that should be easily supported is highly concurrent access to the filesystem, last but not least being able to set up permissions in the system.
Desiderata of distributed file systems 🟩
In this case we have a Filesystem. In 2004 google created his own FS. With hundreds or thousands of machines the systems are practically guaranteed to fail. This leads to a requirement of
- fault tolerance
- We need to detect these errors so we need monitoring and error detection
- we would like the system to be self-regulatory and have the ability of automatic recovery
Efficiency in operations 🟨–
HDFS is designed primarily for write-once, read-many access patterns typical of big data processing workloads. We will later see that MapReduce works quite well with this methods.
We want to be efficient in
- Scanning the file
- Appending information to file atomically So that we can have a temporary place for intermediate data or sensors or logs. This needs to be very efficient for appends. Immutability is ok in this case.
Latency should not be the bottleneck (with plenty of small files we are losing a lot of files to lookup) but throughput should be the bottleneck. This is why the Distributed File System that we will be studying is optimized for high-throughput data processing rather than low-latency access to small files.
Hadoop
Primarily:
• Distributed File System (HDFS)
• MapReduce
• Wide column store (HBase)
Starting form 188 in 2006 when the filesystem has been developed, now it should handle more about 100k files, about 600PB of data!
We will start first by looking at the logical model and then to the physical model, following Codd’s data independence idea (Codd 1970).
The logical model
In this case we have a file hierarchy. And we have block storage (also called chunk, split, shard, partition) instead of unstructured object storage we studied in Cloud Storage#Object Stores, which are seen as a single big object. These documents have a far larger block size, usually of 64-128MB. We have 64 for Google and 128 for Hadoop. In Hadoop, blocks are files, so we don’t have fragmentation on the filesystem. This is because on a network cable having many files takes more time than a single bigger file, because of the latency. of file disk seeks. There are also other reasons it might be advantageous having this block size:
- Limit metadata on NameNode about every block
- Client need to interact less with the NameNode.
- Limit the number of TCP connections that the Client would need to open for each data block on the datanodes. Just few persistent TCPs are ok. See Livello di trasporto for TCP.
The architecture 🟩
With HDFS we have a centralized architecture where we have a main node, called Coordinator, Primary, Leader, NameNode and then secondary nodes called Worker, Secondary, Follower, DataNodes.
When we have a file, we split it into chunks and store it multiple times (3) into different machines. RAID is redundant in this case, so we won’t use it, see Devices OS#RAID for more information.
The physical model
Structure of the NameNode 🟨
- Mapping of file -> Block.
- Block -> Locations on nodes.
- State of the blocks (corrupted updated). and permissions.
- ACL information and filesystem information.
More in particular:
- the file namespace, that is, the hierarchy of directory names and file names, as well as any access control (ACL) information similar to Unix-based systems.
- a mapping from each file to the list of its blocks. Each block, in this list, is represented with a 64-bit identifier; the content of the blocks is not on the NameNode. Blocks are local files. to the locations of its replicas, that is, the list of the DataNodes that store a copy of this block.
The NameNode never initiates connection with the DataNode. It can just answer to DataNodes’ heartbeats.
Structure of the DataNode 🟩–
One important fact that the DataNode has to handle are the heartbeats. These are small communications that the DataNode sends to the NameNode that serve mainly two roles:
- Liveness: to make known that everything is all right, that the node is still alive and running (peculiar observation is that also human communication frequency has something related to this), If no hearthbeat is receaved in 10 minutes (Shvachko et al. 2010), then the node is considered dead.
- Occasionally, contain information about storage of blocks, like the ACKs for successful storage. It can also contain notification of block corruptions, so that the namenode knows it can add it to the list of nodes that contain that block. This is usually done once every 3 seconds but can be configured. Along side this pattern, the DataNode also sends every six hours (configurable) a full report of every block that it contains.
Another responsibility of the DataNode is the replication pipeline. When it receives a file, it should replicate it to the other nodes.
The file systems operations
Read 🟩
- The client asks the NameNode for the blocks and DataNodes.
- The NameNode answers with the list of each block of the file and the DataNodes that store it ordered from distance to the client.
- Then the client asks the DataNodes for the files, which is usually a stream of bytes. It downloads every block in turn.
It is informative to think about the interaction between HBase and HDFS. […] Who is the client here? The RegionServer, which does co-habit with a DataNode. […] This makes accessing the KeyValues in future reads by the RegionServer extremely efficient, because the RegionServer can read the data locally without communicating with the NameNode: this is known as short-circuiting in HDFS.
Write 🟩
This operations is a little bit more complex and needs many machines to be organized with each other:
- The client first locks the file that he’s writing.
- The client asks the NameNode for each block a series of DataNodes with which he can initialize connection.
- Then client instructs the given DataNodes to create the replication pipeline
- Then client sends the block and waits for acks by the DataNodes (only this part is syncronous for the client I think).
- This is done for every block of the file, when it’s finished the client tells the NameNode to release the lock.
- After a bit of time the NameNode should receive information about correct storage by the DataNodes, and checks if every replica is ok, and gives ACK to the client. and it is finished…
- Then if one gets corrupted the NameNode automatically regulates the replicas
Replication strategy 🟩–
Blocks are replicated with some general guidelines in mind. There is first a notion of distance from nodes that is defined by a physical connection distance:
- 2: if the two nodes are in the same rack
- 4: if they are in different racks
There is a general guideline that each node should have at maximum a single replica, and each rack should have at maximum two replicas. Usually the node that processes the request stores a version, then it stores replicas in two nodes in a rack different from the one that is processing the current version. The number of replicas to store is handled by a parameter called replication factor.
Usually the replication is node in this manner:
- In the same node that receives the request
- On two different nodes on another rack
- The fourth, if present, is on a random node.
The reason why two replicas are usually not put on the same node, is that it would become too concentrated.
High Availability
One clear pain point of the current design is the single point of failure of the system: if a NameNode corrupts the data or fails irrecoverably, then the whole system would become unusable. One clearly wants to prevent this from happens. This leads to the idea of logs and namespace files.
Snapshots 🟩
If the NameNode crashes, then everything could be lost, the whole cluster would be useless because we don’t have a way to recover the mappings between the block ids and the files. For this reason a snapshot (namespace file), which contains all the relevant information for a NameNode to work, is usually created and stored on some persistent storage or backed using Cloud Storage options.
Logs 🟩
We also need to balance the creation of snapshots and the number of updates. It would be quite infeasible to create a whole new snapshot after every update, yet at the same time we need to track every change to avoid to lose any.. This brings up the use of logs for every change, that act as a diff file for every single change that has not been saved in the snapshot (when we need to restore, we reapply every single registered change in order to return back to the same state, it usually takes 30 minutes to restore from a crash).
Checkpointing 🟩
When a merge happens this is called a checkpoint, which happens in a periodical fashion. (or could also be manually triggered). The merge is usually done by another node, called phantom node who merges the logs. If the phantom node is configured to take over the main node in case of failure, this is the StandbyNode, explored in a later section.
The snapshot and the logs is usually stored externally in a NAS or something similar (e.g. AWS Glacier, latency is OK for backups)
Special NameNodes
StandBy NameNodes 🟩–
These NameNodes just keep the namespace file updated and up to date without interfering with the main NameNode. Their existence is for high availability of the service, making the system a little more reliable. They do not process write or reads. They just keep reading the edit log and updating the namespace file and the current state of the mappings.
There are also other proposals for merging the logs to the snapshot, so that it would not take half an hour to restart the NameNode in the case of a crash, these are called standby NameNodes that keep track of the current mappings between block ids and files.
Usually these types of NameNodes also receive heartbeats by the NameNodes.
Observer NameNode 🟨++
This is very similar to the standby NameNode, but just handles the Read Request. Clients can connect to the Observer NameNode to know what blocks to read but not write. This is the main different with the StandBy NameNodes: they accept requests, but only reads.
Federated HDFS 🟥–
In these case there are several NameNodes that run at the same time. Each of them has the responsibility for different directories in the filesystem hierarchy. The workload is spread into different nodes. I don’t know what happens when a NameNode fails and you cannot access some directories anymore.
Comparison with GFS 🟩
GFS is another distributed filesystem. The main difference is that the terms are different:
- NameNode - Master
- DataNode - Chunkserver
- Block - Chunk
- FS Image - Checkpoint image
- Edit log - Operation log
Another difference is that the block size is smaller for GFS, in this case 64MB.
References
[1] Shvachko et al. “The Hadoop Distributed File System” 2010 {{IEEE}} 26th {{Symposium}} on {{Mass Storage Systems}} and {{Technologies}} ({{MSST}}) 2010
[2] Codd “A Relational Model of Data for Large Shared Data Banks” Communications of the ACM Vol. 13(6), pp. 377–387 1970