Tuesday 28 March 2017

Comparing with other systems

Before Hadoop arrives into the picture there are several database techniques and server mechanism handled to solve the problems like existing RDBMS, grid computing and few.

A typical RDBMS or relational database management system is the master of real-time queries of structured data, which makes it ideal for real-time online transaction processing or OLTP.  But the businesses of all stripes is rely on this kind of functionality to transact an important business. RDBMS has supported by the biggest companies of the software industry and its used by the every mid-to large sized IT organization in the world. But analyzing unstructured data in RDBMS is like mix oil into water, also when processing the huge unstructured data is hard to be processed that is where Hadoop comes in to the picture. Hadoop is a purpose-built to handle enormous volumes of unstructured data. Where RDBMS are usually run on costly proprietary servers whereas Hadoop is running on commodity hardware servers and Hadoop splits all the data queries into various nodes, making it relatively fault-tolerant.

 

What is RDBMS in depth?


RDBMS is relational database management system. Database management system (DBMS) stores data in the form of tables, which comprises of columns and rows. The structured query language (SQL) will be used to extract necessary data stored in these tables. The RDBMS which stores the relationships between these tables in different forms such as one column entries of a table will serve as a reference for another table. These column values are known as primary keys and foreign keys. These keys will be used to reference the other tables so that the appropriate data can be related and be retrieved by joining these different tables using SQL queries as needed. The tables and the relationships can be manipulated by joining appropriate tables through SQL queries.

The most important attribute of a relational database system is that a single database system generally has several tables and relationships between these tables so that the information is classified into tables of independent entities. They are also stored independently in a normalized or simplified way and a relationship is maintained within these tables using primary/foreign key constraints. This is different from a flat file or data structure. The data on a database could be stored in a single data file or multiple data files. The data file size will grow or the new data files will be added as the new records are added and the size of the database is increased. These all files are commonly shared by the database server. In high availability systems, these data files are shared so that each node will have access to the same data file. Generally, all popular database systems are relational database management systems. In order to give some quick and easy navigation to related data, some logical views are created from the actual tables. There will be a physical existence for every table in the database whereas a view is a virtual table, which does not exist physically rather a logical creation from the existing physical table. IBM DB2, Microsoft SQL Server, Sybase, Oracle, MySQL and PostgreSQL are some examples for RDBMS.

However, RDBMS only work with better when an entity relationship model (ER model) is defined perfectly and therefore, the database schema or structure can grow and unmanaged otherwise and RDBMS works well with structured data.


How Hadoop better than RDBMS?



RDBMS database technology proven, consistent, matured and highly supported by best companies. This works better only when the data is definitions such as data types, relationships among the data, constraints and etc. so, this is more appropriate for real time processing. As already mentioned, RDBMS only work with better only when ER model is defined perfectly, so the database schema or structure can grow and unmanaged otherwise. Especially, where the data size is too large for complex processing also where not easy to define the relationships between the data, then it becomes very difficult to save the extracted information in an RDBMS with some coherent relationship but when it comes to large, unstructured data Hadoop would be the right choice.

Hadoop framework fits any kind of data such as structured, semi structured and unstructured data also this supports variety of data formats such as XML, JSON and text based flat file formats as well. For example, when we start analyze the Internet data published by various websites. Out of those existing thousands of millions of websites, each website is having different types of contents and the relationships between them will not unique. In those cases, Hadoop is a right choice to analyze them. Since the exposure of those capabilities will increase, Today the companies are choosing Hadoop not only for help handling the historically grown huge amount of data, but also using Hadoop for meeting high performance needs for the applications. For example, analyzing a monthly energy usage of a customer by comparing between previous months, between their neighbors or even between customers on their friends. This may bring some awareness, but running such complex comparison by analyzing large set of data takes many hours of processing time, but introduction of Hadoop help improving the computing performances from 10 times to 100 times or even more.

In late 1999, Ebay scaled out across a cluster by Logically partitioning their databases for user data, item, data, purchase data. However, this SQL option did not scale enough for ebay, they have now
moved their items catalogue to HBase. Facebook paired complex sharding and caching to MySQL. Facebook split its MySQL database into ~4,000 shards and 9,000 instances of memcached in order to handle the site’s massive data volume. This became very difficult to maintain and scale and now Facebook has moved their messaging to hbase. There are several videos on the internet about this. Will discuss more about HBase in coming chapter.


                                                       MapReduce                                      RDBMS
1.            Size of data                              Petabytes                                         Gigabytes
2.            Integrity of data                           Low                                                  High
3.            Data schema                            Dynamic                                              Static
4.            Access method                Interactive and Batch                                   Batch
5.            Scaling                                       Linear                                             Nonlinear
6.            Data structure                        Unstructured                                       Structured
7.            Normalization of data              Not Required                                       Required


Grid computing:


Most people are familiar with the concept of a power grid, where various sources of electricity are linked together to supply power to a certain geographical location. The concept of grid computing is very similar, where computers are linked together in a grid to provide a greater computational resource.
Grid computing is an arrangement of computers, connected by a network, where unused processing power on all the machines is harnessed to complete tasks more efficiently. Tasks are distributed amongst the machines, and the results are collected to form a conclusion. The advantage of grid computing is that it reduces the time taken to complete tasks, without increasing costs.
Computers on a grid are not necessarily in the same geographical location, and can be spread out over multiple countries and organizations, or even belong to individuals. 

What is Grid Computing in depth?


These days’ computers have great processing power, even on the lowliest of machines. During an average working day, most of this computational potential lies unutilized by a user. So In a grid computing environment, computers are linked together, so that a task on one machine could utilize the unused processing power on another machine to execute their tasks faster. This arrangement minimizes wasted resources and increases efficiency considerably, as a task split over multiple machines takes significantly less time to compute.


Serial Computing vs. Parallel Computing


Each processor uses a queue system to execute the tasks. Many algorithms are implemented in the system, but, in the essence, there is a task queue to perform the tasks. Basically a single processor can handle only one task at a time and, as a result, since then the programming of software has grown up to execute each task sequentially. For example, if task ABC needs to be executed before task XYZ, the programmer has to ensure that order is maintained in the program, this is known as serial computing.
Even though sequence techniques are playing an important role in computing, there are certain tasks that are mutually exclusive, that can be performed simultaneously. For example, if two tasks can be performed independently of each other, and assume they can those be assigned to two different machines. Now, each of those machine will perform the task independently which was assigned and generating the results substantially faster than the one machine was performing both tasks one after the other. This processing is known as parallel computing.
Hadoop is also inspired by this parallel programming concept and is a mechanism for processing large amounts of data from various sources. For example, web clickstream data, social network logs, etc. This data is so large and it must be distributed across multiple machines in the cluster in order to be processed in a reasonable time limit. This distribution among the cluster implies the parallel computing with the different dataset those are available on the different machines in the cluster. So dataset is not depending on each other while executing since each dataset is distributed across the machines in the cluster and each machine will start processing on the dataset in parallel. Here, mapReduce is an abstraction that allows engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance in nature.

No comments:

Post a Comment