Before Hadoop arrives into the picture
there are several database techniques and server mechanism handled to solve the
problems like existing RDBMS, grid computing and few.
A typical RDBMS or relational database
management system is the master of real-time queries of structured data, which
makes it ideal for real-time online transaction processing or OLTP. But the businesses of all stripes is rely on
this kind of functionality to transact an important business. RDBMS has
supported by the biggest companies of the software industry and its used by the
every mid-to large sized IT organization in the world. But analyzing unstructured
data in RDBMS is like mix oil into water, also when processing the huge
unstructured data is hard to be processed that is where Hadoop comes in to the
picture. Hadoop is a purpose-built to handle enormous volumes of unstructured
data. Where RDBMS are usually run on costly proprietary servers whereas Hadoop
is running on commodity hardware servers and Hadoop splits all the data queries
into various nodes, making it relatively fault-tolerant.
What is RDBMS in depth?
RDBMS is relational database management
system. Database management system (DBMS) stores data in the form of tables,
which comprises of columns and rows. The structured query language (SQL) will
be used to extract necessary data stored in these tables. The RDBMS which
stores the relationships between these tables in different forms such as one
column entries of a table will serve as a reference for another table. These
column values are known as primary keys and foreign keys. These keys will be
used to reference the other tables so that the appropriate data can be related
and be retrieved by joining these different tables using SQL queries as needed.
The tables and the relationships can be manipulated by joining appropriate
tables through SQL queries.
The most important attribute of a relational database system is that a single database system generally has several tables and relationships between these tables so that the information is classified into tables of independent entities. They are also stored independently in a normalized or simplified way and a relationship is maintained within these tables using primary/foreign key constraints. This is different from a flat file or data structure. The data on a database could be stored in a single data file or multiple data files. The data file size will grow or the new data files will be added as the new records are added and the size of the database is increased. These all files are commonly shared by the database server. In high availability systems, these data files are shared so that each node will have access to the same data file. Generally, all popular database systems are relational database management systems. In order to give some quick and easy navigation to related data, some logical views are created from the actual tables. There will be a physical existence for every table in the database whereas a view is a virtual table, which does not exist physically rather a logical creation from the existing physical table. IBM DB2, Microsoft SQL Server, Sybase, Oracle, MySQL and PostgreSQL are some examples for RDBMS.
However, RDBMS only work with better
when an entity relationship model (ER model) is defined perfectly and
therefore, the database schema or structure can grow and unmanaged
otherwise and RDBMS works well with structured data.
How Hadoop better than RDBMS?
RDBMS database technology proven,
consistent, matured and highly supported by best companies. This works better only
when the data is definitions such as data types, relationships among the data,
constraints and etc. so, this is more appropriate for real time processing. As
already mentioned, RDBMS only work with better only when ER model is defined
perfectly, so the database schema or structure can grow and unmanaged
otherwise. Especially, where the data size is too large for complex processing
also where not easy to define the relationships between the data, then it
becomes very difficult to save the extracted information in an RDBMS with some coherent
relationship but when it comes to large, unstructured data Hadoop would be the
right choice.
Hadoop framework fits any kind of data
such as structured, semi structured and unstructured data also this supports
variety of data formats such as XML, JSON and text based flat file formats as well.
For example, when we start analyze the Internet data published by various
websites. Out of those existing thousands of millions of websites, each website
is having different types of contents and the relationships between them will
not unique. In those cases, Hadoop is a right choice to analyze them. Since the
exposure of those capabilities will increase, Today the companies are choosing
Hadoop not only for help handling the historically grown huge amount of data,
but also using Hadoop for meeting high performance needs for the applications.
For example, analyzing a monthly energy usage of a customer by comparing
between previous months, between their neighbors or even between customers on
their friends. This may bring some awareness, but running such complex
comparison by analyzing large set of data takes many hours of processing time, but
introduction of Hadoop help improving the computing performances from 10 times
to 100 times or even more.
In late 1999, Ebay scaled out across a
cluster by Logically partitioning their databases for user data, item, data,
purchase data. However, this SQL option did not scale enough for ebay, they
have now
moved their items catalogue to HBase. Facebook
paired complex sharding and caching to MySQL. Facebook split its MySQL database
into ~4,000 shards and 9,000 instances of memcached in order to handle the
site’s massive data volume. This became very difficult to maintain and scale
and now Facebook has moved their messaging to hbase. There are several videos
on the internet about this. Will discuss more about HBase in coming chapter.
MapReduce
RDBMS
|
1.
Size of data
Petabytes
Gigabytes
|
2.
Integrity of data
Low
High
|
3.
Data schema
Dynamic
Static
|
4.
Access method
Interactive and Batch
Batch
|
5.
Scaling
Linear
Nonlinear
|
6.
Data structure
Unstructured
Structured
|
7.
Normalization of data
Not Required
Required
|
Grid computing:
Most people are familiar with the concept of a power grid,
where various sources of electricity are linked together to supply power to a
certain geographical location. The concept of grid computing is very similar,
where computers are linked together in a grid to provide a greater
computational resource.
Grid computing is an arrangement of computers, connected by
a network, where unused processing power on all the machines is harnessed to
complete tasks more efficiently. Tasks are distributed amongst the machines,
and the results are collected to form a conclusion. The advantage of grid
computing is that it reduces the time taken to complete tasks, without
increasing costs.
Computers on a grid are not necessarily in the same
geographical location, and can be spread out over multiple countries and
organizations, or even belong to individuals.
What is Grid Computing in depth?
These days’ computers have great processing power, even on
the lowliest of machines. During an average working day, most of this
computational potential lies unutilized by a user. So In a grid computing
environment, computers are linked together, so that a task on one machine could
utilize the unused processing power on another machine to execute their tasks
faster. This arrangement minimizes wasted resources and increases efficiency
considerably, as a task split over multiple machines takes significantly less
time to compute.
Serial
Computing vs. Parallel Computing
Each processor uses a queue system to execute the tasks. Many
algorithms are implemented in the system, but, in the essence, there is a task
queue to perform the tasks. Basically a single processor can handle only one
task at a time and, as a result, since then the programming of software has grown
up to execute each task sequentially. For example, if task ABC needs to be
executed before task XYZ, the programmer has to ensure that order is maintained
in the program, this is known as serial computing.
Even though sequence techniques are playing an important
role in computing, there are certain tasks that are mutually exclusive, that
can be performed simultaneously. For example, if two tasks can be performed
independently of each other, and assume they can those be assigned to two
different machines. Now, each of those machine will perform the task
independently which was assigned and generating the results substantially faster
than the one machine was performing both tasks one after the other. This
processing is known as parallel computing.
Hadoop is also inspired by this
parallel programming concept and is a mechanism for processing large
amounts of data from various sources. For example, web clickstream data, social
network logs, etc. This data is so large and it must be distributed across multiple
machines in the cluster in order to be processed in a reasonable time limit.
This distribution among the cluster implies the parallel computing with the
different dataset those are available on the different machines in the cluster.
So dataset is not depending on each other while executing since each dataset is
distributed across the machines in the cluster and each machine will start
processing on the dataset in parallel. Here, mapReduce is an abstraction that
allows engineers to perform simple computations while hiding the details of
parallelization, data distribution, load balancing and fault tolerance in
nature.
No comments:
Post a Comment