Tuesday 28 March 2017

What is Apache Hadoop ?

Apache Hadoop is an open source software platform for distributed storage and distributed/parallel processing of very large data sets on computer clusters built from commodity hardware.  Hadoop services provide for the data storage, data access, data processing, data governance, security, and the operations.

How is Hadoop so important for data processing?


Before diving into Hadoop lets have some fun questions to get answer. 

What is the basic functionality of the system? 
How does the basic read and writes are happening? 

we are writing/storing the data in the filesystem in the form of chunks or block. Let’s say windows machines are using NTFS, Linux using ext3 and ext4, AIX using JFS and other operating system using their own mechanism to store and retrieving the data. The same way Hadoop is using hdfs as a filesystem and MapReduce application to retrieving the data. There are other techniques available in the Hadoop eco system to retrieve the data that we will be coving in coming chapters but the file system remains the same.

So do you think Hadoop will be installing as same as Operating system? No Hadoop will be instating on top of the operating system like a software and on top of the Hadoop other application will be getting installed. Will be discussing more about Hadoop installation and configuration in a coming chapter but just to give fair idea of Hadoop before get in to the details of other applications.
So what is the real requirement? storing massive amount of data and retrieving it for the various analytics and operational purpose. So the requirement would be how quickly we can store the data and retrieve it to do these other existing database systems should be sufficient. But how about large volume of data processing in RDBMS still maintain the speed? Looks like possible isn’t? by adding more resources to it and increase the memory and CPU performance. but remember the cost involved in adding up the resources to the proprietary, license basted system the cost must be huge. How about huge volume of data? still adding to the resource to it? No, adding the resource alone is not going to help me to solve the problems here.
So let’s say if a system takes 60mins to complete the job A. if I am adding one more system called B to do the share job, so job will complete in 30 mins. I am increasing 6 systems to do the same job so will get finish in 10mins, isn’t it? And I am adding the 60 systems to share the job, how quick it will get finish? That’s the power of parallel computing and parallel processing. Hold on what is the cost involve in it? Hadoop can be installed in commodity hardware and open source software so it will be much cheaper than the system running in the proprietary hardware and license based software. So will be seeing how effectively we can explore Hadoop in the coming chapters

Hadoop is having an ability to store and process huge volumes and any structure of data, quickly, with respect to any volumes size and varieties constantly increasing, especially from social media and the Internet of Things (IOT) for the consideration.

 Computing power


Because of Hadoop computing works in parallel, Hadoop’s distributed computing model processes large volume of data in fast. The basic definition of Hadoop is “when more computing nodes you use; the more processing power you have”.

Scalable. 


Hadoop is a highly scalable storage in nature, just because it can store and distribute very large data sets across hundreds of inexpensive servers that can operate in parallel. When there is a need of scaling up the cluster, unlike other system by increasing resource to the specific machine, we can also scale up the cluster by adding more machines to the cluster. Traditional relational database systems that cannot scale to process large amounts of data but Hadoop enables the businesses to run applications on thousands of nodes involving petabytes of data.

Cost effective. 


The main problem with traditional relational database management systems is that it is extremely cost prohibitive to scale enough in order to process massive amount of data. Since Hadoop is running on the commodity hardware and open source software it gives us a cost effective storage solution for businesses where exploding data sets.

Flexible. 


Hadoop capable enough to process any kind of data (structured, semi-structured and unstructured) from any kind of sources, that’s how Hadoop enables the businesses to easily. Hadoop can be used for a wide variety of purposes, such as log processing, data warehousing, market campaign analysis, recommendation engines and fraud detections. This is how businesses can use Hadoop to derive valuable business insights from various data sources such as social media, email conversations and more. 

Fast. 


Hadoop’s unique storage mechanism is based on a distributed file system that basically ‘map’ the data wherever it is located on a cluster. The tools for data processing are frequently on the same servers where the data is located, resulting is in much faster data processing. If we are dealing with large volumes of different structured data, Hadoop is capable enough to process terabytes of data in just minutes and petabytes in just hours.


Resilient to failure.


Another main advantage of using Hadoop is its fault tolerance. When a so may computers are grouped gather that too on the commodity hardware, the chances of hardware failure will be expected, but Hadoop built in nature of handling the hardware failure. When data is sent to an individual node in the cluster, that data is also replicated to other nodes in the same cluster, which means that while hardware failure or the entire node failure, Hadoop will pick up the another copy available in some other’s node in the cluster.

No comments:

Post a Comment