Apache Hadoop is an open source
software platform for distributed storage and distributed/parallel processing
of very large data sets on computer clusters built from commodity hardware.
Hadoop services provide for the data storage, data access, data processing, data
governance, security, and the operations.
How is Hadoop so important for data processing?
Before diving into Hadoop lets have some fun questions to
get answer.
What is the basic functionality of the system?
How does the basic
read and writes are happening?
we are writing/storing the data in the
filesystem in the form of chunks or block. Let’s say windows machines are using
NTFS, Linux using ext3 and ext4, AIX using JFS and other operating system using
their own mechanism to store and retrieving the data. The same way Hadoop is
using hdfs as a filesystem and MapReduce application to retrieving the data.
There are other techniques available in the Hadoop eco system to retrieve the
data that we will be coving in coming chapters but the file system remains the
same.
So do you think Hadoop will be installing as same as
Operating system? No Hadoop will be instating on top of the operating system
like a software and on top of the Hadoop other application will be getting
installed. Will be discussing more about Hadoop installation and configuration
in a coming chapter but just to give fair idea of Hadoop before get in to the
details of other applications.
So what is the real requirement? storing massive amount of
data and retrieving it for the various analytics and operational purpose. So
the requirement would be how quickly we can store the data and retrieve it to
do these other existing database systems should be sufficient. But how about
large volume of data processing in RDBMS still maintain the speed? Looks like
possible isn’t? by adding more resources to it and increase the memory and CPU
performance. but remember the cost involved in adding up the resources to the
proprietary, license basted system the cost must be huge. How about huge volume
of data? still adding to the resource to it? No, adding the resource alone is
not going to help me to solve the problems here.
So let’s say if a system takes 60mins to complete
the job A. if I am adding one more system called B to do the share job, so job
will complete in 30 mins. I am increasing 6 systems to do the same job so will
get finish in 10mins, isn’t it? And I am adding the 60 systems to share the
job, how quick it will get finish? That’s the power of parallel computing and
parallel processing. Hold on what is the cost involve in it? Hadoop can be
installed in commodity hardware and open source software so it will be much
cheaper than the system running in the proprietary hardware and license based
software. So will be seeing how effectively we can explore Hadoop in the coming
chapters
Hadoop is having an ability to store
and process huge volumes and any structure of data, quickly, with respect
to any volumes size and varieties constantly increasing, especially from social
media and the Internet of Things (IOT) for the consideration.
Computing power.
Because of
Hadoop computing works in parallel, Hadoop’s distributed computing model
processes large volume of data in fast. The basic definition of Hadoop is “when
more computing nodes you use; the more processing power you have”.
Scalable.
Hadoop is
a highly scalable storage in nature, just because it can store and distribute
very large data sets across hundreds of inexpensive servers that can operate in
parallel. When there is a need of scaling up the cluster, unlike other system
by increasing resource to the specific machine, we can also scale up the
cluster by adding more machines to the cluster. Traditional relational database
systems that cannot scale to process large amounts of data but Hadoop enables the
businesses to run applications on thousands of nodes involving petabytes of
data.
Cost effective.
The main problem with traditional
relational database management systems is that it is extremely cost prohibitive
to scale enough in order to process massive amount of data. Since Hadoop is
running on the commodity hardware and open source software it gives us a
cost effective storage solution for businesses where exploding data sets.
Flexible.
Hadoop capable enough to process any kind of data
(structured, semi-structured and unstructured) from any kind of sources, that’s
how Hadoop enables the businesses to easily. Hadoop can be used for a
wide variety of purposes, such as log processing, data warehousing, market
campaign analysis, recommendation engines and fraud detections. This is how
businesses can use Hadoop to derive valuable business insights from various data
sources such as social media, email conversations and more.
Fast.
Hadoop’s unique storage mechanism is based on a distributed file
system that basically ‘map’ the data wherever it is located on a cluster. The
tools for data processing are frequently on the same servers where the data is
located, resulting is in much faster data processing. If we are dealing with
large volumes of different structured data, Hadoop is capable enough to process
terabytes of data in just minutes and petabytes in just hours.
Resilient to failure.
Another main advantage of using Hadoop
is its fault tolerance. When a so may computers are grouped gather that too on
the commodity hardware, the chances of hardware failure will be expected, but
Hadoop built in nature of handling the hardware failure. When data is sent to
an individual node in the cluster, that data is also replicated to other nodes
in the same cluster, which means that while hardware failure or the entire node
failure, Hadoop will pick up the another copy available in some other’s
node in the cluster.
No comments:
Post a Comment