Learning Hadoop 2
上QQ阅读APP看书,第一时间看更新

The inner workings of HDFS

In Chapter 1, Introduction, we gave a very high-level overview of HDFS; we will now explore it in a little more detail. As mentioned in that chapter, HDFS can be viewed as a filesystem, though one with very specific performance characteristics and semantics. It's implemented with two main server processes: the NameNode and the DataNodes, configured in a master/slave setup. If you view the NameNode as holding all the filesystem metadata and the DataNodes as holding the actual filesystem data (blocks), then this is a good starting point. Every file placed onto HDFS will be split into multiple blocks that might reside on numerous DataNodes, and it's the NameNode that understands how these blocks can be combined to construct the files.

Cluster startup

Let's explore the various responsibilities of these nodes and the communication between them by assuming we have an HDFS cluster that was previously shut down and then examining the startup behavior.

NameNode startup

We'll firstly consider the startup of the NameNode (though there is no actual ordering requirement for this and we are doing it for narrative reasons alone). The NameNode actually stores two types of data about the filesystem:

  • The structure of the filesystem, that is, directory names, filenames, locations, and attributes
  • The blocks that comprise each file on the filesystem

This data is stored in files that the NameNode reads at startup. Note that the NameNode does not persistently store the mapping of the blocks that are stored on particular DataNodes; we'll see how that information is communicated shortly.

Because the NameNode relies on this in-memory representation of the filesystem, it tends to have quite different hardware requirements compared to the DataNodes. We'll explore hardware selection in more detail in Chapter 10, Running a Hadoop Cluster; for now, just remember that the NameNode tends to be quite memory hungry. This is particularly true on very large clusters with many (millions or more) files, particularly if these files have very long names. This scaling limitation on the NameNode has also led to an additional Hadoop 2 feature that we will not explore in much detail: NameNode federation, whereby multiple NameNodes (or NameNode HA pairs) work collaboratively to provide the overall metadata for the full filesystem.

The main file written by the NameNode is called fsimage; this is the single most important piece of data in the entire cluster, as without it, the knowledge of how to reconstruct all the data blocks into the usable filesystem is lost. This file is read into memory and all future modifications to the filesystem are applied to this in-memory representation of the filesystem. The NameNode does not write out new versions of fsimage as new changes are applied after it is run; instead, it writes another file called edits, which is a list of the changes that have been made since the last version of fsimage was written.

The NameNode startup process is to first read the fsimage file, then to read the edits file, and apply all the changes stored in the edits file to the in-memory copy of fsimage. It then writes to disk a new up-to-date version of the fsimage file and is ready to receive client requests.

DataNode startup

When the DataNodes start up, they first catalog the blocks for which they hold copies. Typically, these blocks will be written simply as files on the local DataNode filesystem. The DataNode will perform some block consistency checking and then report to the NameNode the list of blocks for which it has valid copies. This is how the NameNode constructs the final mapping it requires—by learning which blocks are stored on which DataNodes. Once the DataNode has registered itself with the NameNode, an ongoing series of heartbeat requests will be sent between the nodes to allow the NameNode to detect DataNodes that have shut down, become unreachable, or have newly entered the cluster.

Block replication

HDFS replicates each block onto multiple DataNodes; the default replication factor is 3, but this is configurable on a per-file level. HDFS can also be configured to be able to determine whether given DataNodes are in the same physical hardware rack or not. Given smart block placement and this knowledge of the cluster topology, HDFS will attempt to place the second replica on a different host but in the same equipment rack as the first and the third on a host outside the rack. In this way, the system can survive the failure of as much as a full rack of equipment and still have at least one live replica for each block. As we'll see in Chapter 3, Processing – MapReduce and Beyond, knowledge of block placement also allows Hadoop to schedule processing as near as possible to a replica of each block, which can greatly improve performance.

Remember that replication is a strategy for resilience but is not a backup mechanism; if you have data mastered in HDFS that is critical, then you need to consider backup or other approaches that give protection for errors, such as accidentally deleted files, against which replication will not defend.

When the NameNode starts up and is receiving the block reports from the DataNodes, it will remain in safe mode until a configurable threshold of blocks (the default is 99.9 percent) have been reported as live. While in safe mode, clients cannot make any modifications to the filesystem.