The default big data storage layer for Apache Hadoop
is HDFS and it is designed to run on the commodity machines which are of
low cost hardware servers. The distributed data is stored in the HDFS file system.
HDFS is highly fault tolerant system and provides high throughput access to the
applications that are required big data.
HDFS comprises of 3 important components actually, NameNode,
DataNode and Secondary NameNode. As already mentioned, HDFS operates on a
Master-Slave architecture model where the master being the namenode and slaves
are the datanodes. The namenode controls the access to the data by its clients
and the datanodes manage to store the data on the nodes that are running on. As
same as the other filesystems Hadoop also splits the file into one or more
blocks and these blocks are stored in the datanodes and each of those data
block is replicated to 3 different datanodes to provide high availability to
the hadoop eco system. This is how Hadoop being fault tolerant in nature. Also,
the block replication factor is configurable. HDFS component creates
several copies of the data block to be distributed across different nodes in
the cluster for reliable and quick data access.
Demons of HDFS:
NameNode:
Namenode is the main base of the Hadoop distributed file
system. The Namenode keeps track of the metadata information and location of the
data blocks on the data node. This metadata information is stored permanently
on to local hard disk of the namenode in the form of namespace image known as
FS image and edit log file.
FS image is snap shot of the file system and current changes
like creation, deletion, updating will be in the edit logs. Often these edit
logs will be converted as a snapshot that is the fsimage. However, the namenode does not store this
information persistently. The namenode creates the block to datanode mapping
when it is restarted. So, If the namenode crashes, then the entire hadoop
system goes down.
[root@sandbox ~]# ls -lrt /hadoop/hdfs/namenode/current/
-rw-r--r-- 1 hdfs hadoop 21505030 Mar 28 10:39
edits_0000000000020911480-0000000000021041539
-rw-r--r-- 1 hdfs hadoop
7340032 Mar 28 11:26 edits_0000000000021041540-0000000000021078684
-rw-r--r-- 1 hdfs hadoop
5639982 Mar 29 04:52 fsimage_0000000000021078684
-rw-r--r-- 1 hdfs hadoop 62 Mar 29 04:52
fsimage_0000000000021078684.md5
-rw-r--r-- 1 hdfs hadoop
203 Mar 29 04:52 VERSION
-rw-r--r-- 1 hdfs hadoop
73087 Mar 29 04:59 edits_0000000000021078685-0000000000021079182
-rw-r--r-- 1 hdfs hadoop 9 Mar 29 04:59 seen_txid
-rw-r--r-- 1 hdfs hadoop
5603567 Mar 29 05:00 fsimage_0000000000021079182
-rw-r--r-- 1 hdfs hadoop 62 Mar 29 05:00
fsimage_0000000000021079182.md5
-rw-r--r-- 1 hdfs hadoop
5242880 Mar 29 05:21 edits_inprogress_0000000000021079183
[root@sandbox ~]#
Secondary Namenode:
The responsibility of secondary name node is to periodically
copy and merge the namespace image and edit log from the namenode. Actually,
the secondary namenode help the namenode to create the fsimage from editlogs in
regular interval also in case if the name node crashes, then the namespace
image stored in secondary namenode can be used to restart the namenode. Generally
the Namenode services and the secondary namenode services has to be in
different nodes.
[root@sandbox ~]# ls -lrt /hadoop/hdfs/namesecondary/current/
-rw-r--r-- 1 hdfs hadoop
31794 Mar 28 04:39 edits_0000000000020911255-0000000000020911479
-rw-r--r-- 1 hdfs hadoop 21505030 Mar 28 10:39
edits_0000000000020911480-0000000000021041539
-rw-r--r-- 1 hdfs hadoop
5639982 Mar 29 05:00 fsimage_0000000000021078684
-rw-r--r-- 1 hdfs hadoop 62 Mar 29 05:00
fsimage_0000000000021078684.md5
-rw-r--r-- 1 hdfs hadoop
73087 Mar 29 05:00 edits_0000000000021078685-0000000000021079182
-rw-r--r-- 1 hdfs hadoop
5603567 Mar 29 05:00 fsimage_0000000000021079182
-rw-r--r-- 1 hdfs hadoop 62 Mar 29 05:00
fsimage_0000000000021079182.md5
-rw-r--r-- 1 hdfs hadoop
203 Mar 29 05:00 VERSION
[root@sandbox ~]#DataNode:
Data Node is where the
actual data has been stored as chunks or the blocks. While storing the data in Hadoop,
the data will be split by the split size mentioned and store in the cluster. The
default split size is 64MB we can change as per the need.
Let’s say that we are going
to store 128MB of data in Hadoop. Since, the default split size is 64MB, so the
124MB data will be split into 2 blocks and store across the cluster. Lets go
for another example to storing 65MB of data. so the default split size is 64MB,
so the 65MB of file will be split into 2 blocks and store in the cluster. Will be
discussing more about the data nodes and the replication factors on another
blog.
Journal Node:
Well, the name node is
single point of failure, some High Availably(HA) has been introduced that is
called as a Journal node. The JournalNode will not have fsimage files or
seen_txid. In addition, it contains several other files relevant to the HA
implementation. These files help prevent a split-brain scenario, in which
multiple NameNodes could think they are active and all try to write edits.
No comments:
Post a Comment