BigData: March 2017

Wednesday, 29 March 2017

Main contributors of Hadoop

There are a number of distributions of Hadoop. A comprehensive list can be found at hadoop wiki page. We will be examining three of them:

• Cloudera Distribution of Hadoop (CDH)

• Hortonworks Data Platform (HDP)

• MapR

Cloudera

Cloudera was founded by big data geniuses from Facebook, Google, Oracle and Yahoo in 2008. It was the first company to develop the distribute Apache Hadoop-based software. And it has the largest user base with most number of clients. Although the core of the distribution is based on Apache Hadoop open source communities, it also provides a proprietary Cloudera Management Suite to automate the installation, configuration process and provide other services to enhance convenience of users which include reducing deployment time, displaying real time nodes’ count, etc. CDH is in its fifth major version right now and is considered a mature Hadoop distribution. The paid version of CDH comes with a proprietary management software, Cloudera Manager.

Hortonworks Data Platform (HDP)

Hortonworks, founded in 2011, has quickly emerged as one of the leading vendors of Hadoop. The distribution provides open source platform based on Apache Hadoop for analyzing, storing and managing big data. Hortonworks is the only commercial vendor to distribute complete open source Apache Hadoop without additional proprietary software. Hortonworks distribution HDP2.5 can be directly downloaded from their website free of cost and is easy to install. The engineers of Hortonworks are behind most of Hadoop’s recent innovations including Yarn, which is better than MapReduce in the sense that it will enable inclusion of more data processing frameworks. HDP is in its second major version currently and is considered the rising star in Hadoop distributions. It comes with a free and open source management software called Ambari.

MAPR

In its standard, open source edition, Apache Hadoop software comes with a number of restrictions. Vendor distributions are aimed at overcoming the issues that the users typically encounter in the standard editions. Under the free Apache license, all the three distributions provide the users with the updates on core Hadoop software. But when it comes to handpicking any one of them, one should look at the additional value it is providing to the customers in terms of improving the reliability of the system (detecting and fixing bugs etc), providing technical assistance and expanding functionalities.

MapR comes with its own management console. The different grades of the product are named as M3, M5, and M7. M5 is a standard commercial distribution from the company, M3 is a free version without high availability, and M7 is a paid version with a rewritten HBase API.

Tuesday, 28 March 2017

Hadoop Distributed File System (HDFS)

The default big data storage layer for Apache Hadoop is HDFS and it is designed to run on the commodity machines which are of low cost hardware servers. The distributed data is stored in the HDFS file system. HDFS is highly fault tolerant system and provides high throughput access to the applications that are required big data.

HDFS comprises of 3 important components actually, NameNode, DataNode and Secondary NameNode. As already mentioned, HDFS operates on a Master-Slave architecture model where the master being the namenode and slaves are the datanodes. The namenode controls the access to the data by its clients and the datanodes manage to store the data on the nodes that are running on. As same as the other filesystems Hadoop also splits the file into one or more blocks and these blocks are stored in the datanodes and each of those data block is replicated to 3 different datanodes to provide high availability to the hadoop eco system. This is how Hadoop being fault tolerant in nature. Also, the block replication factor is configurable. HDFS component creates several copies of the data block to be distributed across different nodes in the cluster for reliable and quick data access.

Demons of HDFS:

NameNode:

Namenode is the main base of the Hadoop distributed file system. The Namenode keeps track of the metadata information and location of the data blocks on the data node. This metadata information is stored permanently on to local hard disk of the namenode in the form of namespace image known as FS image and edit log file.

FS image is snap shot of the file system and current changes like creation, deletion, updating will be in the edit logs. Often these edit logs will be converted as a snapshot that is the fsimage. However, the namenode does not store this information persistently. The namenode creates the block to datanode mapping when it is restarted. So, If the namenode crashes, then the entire hadoop system goes down.

[root@sandbox ~]# ls -lrt /hadoop/hdfs/namenode/current/

-rw-r--r-- 1 hdfs hadoop 21505030 Mar 28 10:39 edits_0000000000020911480-0000000000021041539

-rw-r--r-- 1 hdfs hadoop 7340032 Mar 28 11:26 edits_0000000000021041540-0000000000021078684

-rw-r--r-- 1 hdfs hadoop 5639982 Mar 29 04:52 fsimage_0000000000021078684

-rw-r--r-- 1 hdfs hadoop 62 Mar 29 04:52 fsimage_0000000000021078684.md5

-rw-r--r-- 1 hdfs hadoop 203 Mar 29 04:52 VERSION

-rw-r--r-- 1 hdfs hadoop 73087 Mar 29 04:59 edits_0000000000021078685-0000000000021079182

-rw-r--r-- 1 hdfs hadoop 9 Mar 29 04:59 seen_txid

-rw-r--r-- 1 hdfs hadoop 5603567 Mar 29 05:00 fsimage_0000000000021079182

-rw-r--r-- 1 hdfs hadoop 62 Mar 29 05:00 fsimage_0000000000021079182.md5

-rw-r--r-- 1 hdfs hadoop 5242880 Mar 29 05:21 edits_inprogress_0000000000021079183

[root@sandbox ~]#

Secondary Namenode:

The responsibility of secondary name node is to periodically copy and merge the namespace image and edit log from the namenode. Actually, the secondary namenode help the namenode to create the fsimage from editlogs in regular interval also in case if the name node crashes, then the namespace image stored in secondary namenode can be used to restart the namenode. Generally the Namenode services and the secondary namenode services has to be in different nodes.

[root@sandbox ~]# ls -lrt /hadoop/hdfs/namesecondary/current/

-rw-r--r-- 1 hdfs hadoop 31794 Mar 28 04:39 edits_0000000000020911255-0000000000020911479

-rw-r--r-- 1 hdfs hadoop 21505030 Mar 28 10:39 edits_0000000000020911480-0000000000021041539

-rw-r--r-- 1 hdfs hadoop 5639982 Mar 29 05:00 fsimage_0000000000021078684

-rw-r--r-- 1 hdfs hadoop 62 Mar 29 05:00 fsimage_0000000000021078684.md5

-rw-r--r-- 1 hdfs hadoop 73087 Mar 29 05:00 edits_0000000000021078685-0000000000021079182

-rw-r--r-- 1 hdfs hadoop 5603567 Mar 29 05:00 fsimage_0000000000021079182

-rw-r--r-- 1 hdfs hadoop 62 Mar 29 05:00 fsimage_0000000000021079182.md5

-rw-r--r-- 1 hdfs hadoop 203 Mar 29 05:00 VERSION

[root@sandbox ~]#

DataNode:

Data Node is where the actual data has been stored as chunks or the blocks. While storing the data in Hadoop, the data will be split by the split size mentioned and store in the cluster. The default split size is 64MB we can change as per the need.

Let’s say that we are going to store 128MB of data in Hadoop. Since, the default split size is 64MB, so the 124MB data will be split into 2 blocks and store across the cluster. Lets go for another example to storing 65MB of data. so the default split size is 64MB, so the 65MB of file will be split into 2 blocks and store in the cluster. Will be discussing more about the data nodes and the replication factors on another blog.

Journal Node:

Well, the name node is single point of failure, some High Availably(HA) has been introduced that is called as a Journal node. The JournalNode will not have fsimage files or seen_txid. In addition, it contains several other files relevant to the HA implementation. These files help prevent a split-brain scenario, in which multiple NameNodes could think they are active and all try to write edits.

What is Apache Hadoop ?

Apache Hadoop is an open source software platform for distributed storage and distributed/parallel processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for the data storage, data access, data processing, data governance, security, and the operations.

How is Hadoop so important for data processing?

Before diving into Hadoop lets have some fun questions to get answer.

What is the basic functionality of the system?

How does the basic read and writes are happening?

we are writing/storing the data in the filesystem in the form of chunks or block. Let’s say windows machines are using NTFS, Linux using ext3 and ext4, AIX using JFS and other operating system using their own mechanism to store and retrieving the data. The same way Hadoop is using hdfs as a filesystem and MapReduce application to retrieving the data. There are other techniques available in the Hadoop eco system to retrieve the data that we will be coving in coming chapters but the file system remains the same.

So do you think Hadoop will be installing as same as Operating system? No Hadoop will be instating on top of the operating system like a software and on top of the Hadoop other application will be getting installed. Will be discussing more about Hadoop installation and configuration in a coming chapter but just to give fair idea of Hadoop before get in to the details of other applications.

So what is the real requirement? storing massive amount of data and retrieving it for the various analytics and operational purpose. So the requirement would be how quickly we can store the data and retrieve it to do these other existing database systems should be sufficient. But how about large volume of data processing in RDBMS still maintain the speed? Looks like possible isn’t? by adding more resources to it and increase the memory and CPU performance. but remember the cost involved in adding up the resources to the proprietary, license basted system the cost must be huge. How about huge volume of data? still adding to the resource to it? No, adding the resource alone is not going to help me to solve the problems here.

So let’s say if a system takes 60mins to complete the job A. if I am adding one more system called B to do the share job, so job will complete in 30 mins. I am increasing 6 systems to do the same job so will get finish in 10mins, isn’t it? And I am adding the 60 systems to share the job, how quick it will get finish? That’s the power of parallel computing and parallel processing. Hold on what is the cost involve in it? Hadoop can be installed in commodity hardware and open source software so it will be much cheaper than the system running in the proprietary hardware and license based software. So will be seeing how effectively we can explore Hadoop in the coming chapters

Hadoop is having an ability to store and process huge volumes and any structure of data, quickly, with respect to any volumes size and varieties constantly increasing, especially from social media and the Internet of Things (IOT) for the consideration.

Computing power.

Because of Hadoop computing works in parallel, Hadoop’s distributed computing model processes large volume of data in fast. The basic definition of Hadoop is “when more computing nodes you use; the more processing power you have”.

Scalable.

Hadoop is a highly scalable storage in nature, just because it can store and distribute very large data sets across hundreds of inexpensive servers that can operate in parallel. When there is a need of scaling up the cluster, unlike other system by increasing resource to the specific machine, we can also scale up the cluster by adding more machines to the cluster. Traditional relational database systems that cannot scale to process large amounts of data but Hadoop enables the businesses to run applications on thousands of nodes involving petabytes of data.

Cost effective.

The main problem with traditional relational database management systems is that it is extremely cost prohibitive to scale enough in order to process massive amount of data. Since Hadoop is running on the commodity hardware and open source software it gives us a cost effective storage solution for businesses where exploding data sets.

Flexible.

Hadoop capable enough to process any kind of data (structured, semi-structured and unstructured) from any kind of sources, that’s how Hadoop enables the businesses to easily. Hadoop can be used for a wide variety of purposes, such as log processing, data warehousing, market campaign analysis, recommendation engines and fraud detections. This is how businesses can use Hadoop to derive valuable business insights from various data sources such as social media, email conversations and more.

Fast.

Hadoop’s unique storage mechanism is based on a distributed file system that basically ‘map’ the data wherever it is located on a cluster. The tools for data processing are frequently on the same servers where the data is located, resulting is in much faster data processing. If we are dealing with large volumes of different structured data, Hadoop is capable enough to process terabytes of data in just minutes and petabytes in just hours.

Resilient to failure.

Another main advantage of using Hadoop is its fault tolerance. When a so may computers are grouped gather that too on the commodity hardware, the chances of hardware failure will be expected, but Hadoop built in nature of handling the hardware failure. When data is sent to an individual node in the cluster, that data is also replicated to other nodes in the same cluster, which means that while hardware failure or the entire node failure, Hadoop will pick up the another copy available in some other’s node in the cluster.

Comparing with other systems

Before Hadoop arrives into the picture there are several database techniques and server mechanism handled to solve the problems like existing RDBMS, grid computing and few.

A typical RDBMS or relational database management system is the master of real-time queries of structured data, which makes it ideal for real-time online transaction processing or OLTP. But the businesses of all stripes is rely on this kind of functionality to transact an important business. RDBMS has supported by the biggest companies of the software industry and its used by the every mid-to large sized IT organization in the world. But analyzing unstructured data in RDBMS is like mix oil into water, also when processing the huge unstructured data is hard to be processed that is where Hadoop comes in to the picture. Hadoop is a purpose-built to handle enormous volumes of unstructured data. Where RDBMS are usually run on costly proprietary servers whereas Hadoop is running on commodity hardware servers and Hadoop splits all the data queries into various nodes, making it relatively fault-tolerant.

What is RDBMS in depth?

RDBMS is relational database management system. Database management system (DBMS) stores data in the form of tables, which comprises of columns and rows. The structured query language (SQL) will be used to extract necessary data stored in these tables. The RDBMS which stores the relationships between these tables in different forms such as one column entries of a table will serve as a reference for another table. These column values are known as primary keys and foreign keys. These keys will be used to reference the other tables so that the appropriate data can be related and be retrieved by joining these different tables using SQL queries as needed. The tables and the relationships can be manipulated by joining appropriate tables through SQL queries.

The most important attribute of a relational database system is that a single database system generally has several tables and relationships between these tables so that the information is classified into tables of independent entities. They are also stored independently in a normalized or simplified way and a relationship is maintained within these tables using primary/foreign key constraints. This is different from a flat file or data structure. The data on a database could be stored in a single data file or multiple data files. The data file size will grow or the new data files will be added as the new records are added and the size of the database is increased. These all files are commonly shared by the database server. In high availability systems, these data files are shared so that each node will have access to the same data file. Generally, all popular database systems are relational database management systems. In order to give some quick and easy navigation to related data, some logical views are created from the actual tables. There will be a physical existence for every table in the database whereas a view is a virtual table, which does not exist physically rather a logical creation from the existing physical table. IBM DB2, Microsoft SQL Server, Sybase, Oracle, MySQL and PostgreSQL are some examples for RDBMS.

However, RDBMS only work with better when an entity relationship model (ER model) is defined perfectly and therefore, the database schema or structure can grow and unmanaged otherwise and RDBMS works well with structured data.

How Hadoop better than RDBMS?

RDBMS database technology proven, consistent, matured and highly supported by best companies. This works better only when the data is definitions such as data types, relationships among the data, constraints and etc. so, this is more appropriate for real time processing. As already mentioned, RDBMS only work with better only when ER model is defined perfectly, so the database schema or structure can grow and unmanaged otherwise. Especially, where the data size is too large for complex processing also where not easy to define the relationships between the data, then it becomes very difficult to save the extracted information in an RDBMS with some coherent relationship but when it comes to large, unstructured data Hadoop would be the right choice.

Hadoop framework fits any kind of data such as structured, semi structured and unstructured data also this supports variety of data formats such as XML, JSON and text based flat file formats as well. For example, when we start analyze the Internet data published by various websites. Out of those existing thousands of millions of websites, each website is having different types of contents and the relationships between them will not unique. In those cases, Hadoop is a right choice to analyze them. Since the exposure of those capabilities will increase, Today the companies are choosing Hadoop not only for help handling the historically grown huge amount of data, but also using Hadoop for meeting high performance needs for the applications. For example, analyzing a monthly energy usage of a customer by comparing between previous months, between their neighbors or even between customers on their friends. This may bring some awareness, but running such complex comparison by analyzing large set of data takes many hours of processing time, but introduction of Hadoop help improving the computing performances from 10 times to 100 times or even more.

In late 1999, Ebay scaled out across a cluster by Logically partitioning their databases for user data, item, data, purchase data. However, this SQL option did not scale enough for ebay, they have now

moved their items catalogue to HBase. Facebook paired complex sharding and caching to MySQL. Facebook split its MySQL database into ~4,000 shards and 9,000 instances of memcached in order to handle the site’s massive data volume. This became very difficult to maintain and scale and now Facebook has moved their messaging to hbase. There are several videos on the internet about this. Will discuss more about HBase in coming chapter.

MapReduce RDBMS

1. Size of data Petabytes Gigabytes

2. Integrity of data Low High

3. Data schema Dynamic Static

4. Access method Interactive and Batch Batch

5. Scaling Linear Nonlinear

6. Data structure Unstructured Structured

7. Normalization of data Not Required Required

Grid computing:

Most people are familiar with the concept of a power grid, where various sources of electricity are linked together to supply power to a certain geographical location. The concept of grid computing is very similar, where computers are linked together in a grid to provide a greater computational resource.

Grid computing is an arrangement of computers, connected by a network, where unused processing power on all the machines is harnessed to complete tasks more efficiently. Tasks are distributed amongst the machines, and the results are collected to form a conclusion. The advantage of grid computing is that it reduces the time taken to complete tasks, without increasing costs.

Computers on a grid are not necessarily in the same geographical location, and can be spread out over multiple countries and organizations, or even belong to individuals.

What is Grid Computing in depth?

These days’ computers have great processing power, even on the lowliest of machines. During an average working day, most of this computational potential lies unutilized by a user. So In a grid computing environment, computers are linked together, so that a task on one machine could utilize the unused processing power on another machine to execute their tasks faster. This arrangement minimizes wasted resources and increases efficiency considerably, as a task split over multiple machines takes significantly less time to compute.

Serial Computing vs. Parallel Computing

Each processor uses a queue system to execute the tasks. Many algorithms are implemented in the system, but, in the essence, there is a task queue to perform the tasks. Basically a single processor can handle only one task at a time and, as a result, since then the programming of software has grown up to execute each task sequentially. For example, if task ABC needs to be executed before task XYZ, the programmer has to ensure that order is maintained in the program, this is known as serial computing.

Even though sequence techniques are playing an important role in computing, there are certain tasks that are mutually exclusive, that can be performed simultaneously. For example, if two tasks can be performed independently of each other, and assume they can those be assigned to two different machines. Now, each of those machine will perform the task independently which was assigned and generating the results substantially faster than the one machine was performing both tasks one after the other. This processing is known as parallel computing.

Hadoop is also inspired by this parallel programming concept and is a mechanism for processing large amounts of data from various sources. For example, web clickstream data, social network logs, etc. This data is so large and it must be distributed across multiple machines in the cluster in order to be processed in a reasonable time limit. This distribution among the cluster implies the parallel computing with the different dataset those are available on the different machines in the cluster. So dataset is not depending on each other while executing since each dataset is distributed across the machines in the cluster and each machine will start processing on the dataset in parallel. Here, mapReduce is an abstraction that allows engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance in nature.