Thursday 6 April 2017

Reading and writing the data in HDFS

Basic file operation in HDFS


User can add the files by copying/put the files to HDFS. Once the file copied into HDFS the file cannot be modified or altered. HDFS implements a single-writer, multiple-reader model. If you really wants to alter the file you may have to copy to your local filesystem by using the ‘get’ command and alter the file and reupload to HDFS. That’s why HDFS also called as write once and read many file systems. Imagine like you are uploading the photos to cloud or the facebook, if you really want to do some changes in the photo you may have to delete the post from facebook and re-upload from local machine again.

An HDFS file consists of blocks. When there is a need for a new block, the NameNode allocates a block with a unique block ID and determines a list of DataNodes to host replicas of the block. The DataNodes form a pipeline, the order of which minimizes the total network distance from the client to the last DataNode. Bytes are pushed to the pipeline as a sequence of packets. The bytes that an application writes first buffer at the client side. After a packet buffer is filled (typically 64 KB), the data are pushed to the pipeline. The next packet can be pushed to the pipeline before receiving the acknowledgment for the previous packets. The number of outstanding packets is limited by the outstanding packets window size of the client.

When we using an HDFS client to store a file into an HDFS cluster, it almost feels like we are writing the file to a local file system. But behind the scene the file is split up into equal sized blocks, which then are stored on different machines. The default size of these blocks is 64MB. One obvious advantage of this behavior is the ability to store files bigger than any disc on a single server.

The nature of the HDFS as below.

Immutable Files: There is no chance to change the content of an existing file. Only new content can be added at the end of a file.

One writer at a time: Only a single client can write to the same file. It's not possible that multiple clients write to the same file at the same time.

Data Coherency: During a write operation, the written file content may not be visible to other clients immediately. Only after the writing client concludes the write operation, it is guaranteed that other clients see the full content of a file.

Access Rights: Similar to any Linux system, there exists three access right levels for each file and directory read (r), write (w), execute (x). In contrast to Linux system the execute permission doesn't matter for files because it is not possible to execute files on HDFS.

HDFS Architecture:


NameNode:


The NameNode maintains the namespace tree and the mapping of blocks to DataNodes. The current design has a single NameNode for each cluster. The cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently.


Replication Management:


The NameNode ensure that each block always has the number of replicas. While copying the data to HDFS the data will get split based on the input split size and each block will be stored in each datanode across in the cluster and the meta data information about the data on the datanode will be maintained in the NameNode. Whenever reading on the input data the request will goto the Namenode for the information about the blocks where it get stored and will give the result. Imagine of the failure in the data node, the block can not be reached while reading or any operation on the particular data and the failure in the operation on the data.
To overcome this issue, hadoop has the special characteristic called replication management. Each block will be replicated and same block will be available on another data node. whenever failure in the datanode or when the block can not be accessible the request will goto the another datanode where the block is available. Because of this feature the Hadoop will be more reliable and fault tolerance 


Managing Metadata:



The namenode stores its filesystem metadata on local filesystem disks in a few different files, the two most important of which are fsimage and edits. Just like a database would, fsimage contains a complete snapshot of the filesystem metadata whereas edits only incremental modifications made to the metadata. A common practice for high throughput data stores, use of a write ahead log (WAL) such as the edits file reduces I/O operations to sequential, append-only operations (in the context of the namenode, since it serves directly from RAM), which avoids costly seek operations and yields better overall performance. Upon namenode startup, the fsimage file is loaded into RAM and any changes in the edits file are replayed, bringing the in-memory view of the filesystem up to date.

That being said, you should never make direct changes to the metadata files unless you really know what you are doing. Recalling from earlier that the namenode writes changes only to its write ahead log, edits. Over time, the edits file grows and as with any log-based system such as this, would take a long time to replay in the event of server failure. Similar to a relational database, the edits file needs to be periodically applied to the fsimage file. The problem is that the namenode may not have the available resources—CPU or RAM— to do this while continuing to provide service to the cluster. This is where the secondary namenode comes into the picture.


1. The secondary namenode communicate with the namenode to roll its edits file and begin writing to edits.new.
2. The secondary namenode copies the fsimage and edits files from namenode to its local checkpoint directory.
3. Now, the secondary namenode loads the copied fsimage, and replays edits on top of it, and writes a new, compacted fsimage file on the secondary namenode's disk.
4. And, the secondary namenode sends the new created fsimage file to the namenode, which adopts it.
5. The namenode renames edits.new to edits.

This process is slightly different for Apache Hadoop 2.x.


This process occurs every hour or whenever the namenode’s edits file reaches 64 MB. These are the values by default but we can tweak it into hdfs-site.xml.


NameNode High Availability


As administrators responsible for the health and service of large-scale systems, but the single point of failure in Namenode looks the framework unhealthy. Namenode high availability is deployed as an active/passive pair of namenodes. So, The edits write ahead log will to be available to both namenodes, and it is stored on a shared storage device. NFS, RAID and few other techniques are following to maintain the NameNode high availability but now a days hadoop contributors are coming up with NameNode HA easy configurable. 

This high-availability pair of namenodes can be configured for manual or automatic failover. In the manual failover mode a the manuall command will take take a effect to change the namenode to Namenode HA.  When it is comes to automatic failover, each namenode runs with the failover controller that will monitors the health of the process. Just as in other HA systems, there are two primary types of failovers graceful failover, that will be initiated by a administrator, and nongraceful failover, which is the result of a detected fault in the active process.

Cluster Balance


When copying data into HDFS, it is important to consider cluster balance. When the file blocks are spread across the cluster, so it necessary to make sure all the all the blocks and chuncks are spread equally between all the data nodes. For example, if when you specified -m 1, a single map would do the copy the blocks across the cluster, which apart from being slow and not using the cluster resources efficiently what it mean that the first replica of each block would reside on the node running the map until the disk filled up. The second and third replicas would be spread across the cluster, but this one node possibly unbalanced. By having more maps than nodes in the cluster, the problem will be avoided. For this reason, it’s best to start by running distcp with the default of 20 maps per node. However, it’s not always possible to prevent a cluster from becoming unbalanced. Perhaps you want to limit the number of maps so that some of the nodes can be used by other jobs. 

No comments:

Post a Comment