Friday 21 April 2017

Hive joins

The Hive Query Language is a query language for Hive to process, analyze the structured data in a Hadoop Eco system. We have seen creation of database and table operation of Hive, also we have seen creation of table from the text file in HDFS. This blog will cover some basic data manipulation operations.

Select clauses:


SELECT statement is used to retrieve the data from the hive table.

Syntax:
select * from table_name;

hive> select * from hadoopd_tbl limit 5;
OK
88-124          2017-02-25      69.0
13-92           2017-02-25      9.0
13-85           2017-02-25      35.0
08-76           2017-02-25      3.0
01322           2017-02-25      10.0
Time taken: 0.356 seconds, Fetched: 5 row(s)

Syntax:
Select column_name from table_name;

hive> select item_id from hadoopd_tbl limit 5;
OK
88-124
13-92
13-85
08-76
01322
Time taken: 0.324 seconds, Fetched: 5 row(s)
hive>

Entered Limit clause just to see only 5 rows in the screen.

To see the Row count from the table.

Syntax:
Select count(*) from Table_Name;
Select count(column_Name) from Table_Name;

Count(*) will be taking us to a Map Reduce job. Also, when applying any function the query will be run as a MapReduce job to get the advantage of distributed computing.

hive> select count(*) from hadoopd_tbl ;
Query ID = hive_20170413065454_39ce59cd-7e32-4781-b5a0-efa68f1d4c75
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.


Status: Running (Executing on YARN cluster with App id application_1492062882009_0003)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 23.87 s
--------------------------------------------------------------------------------
OK
19
Time taken: 53.579 seconds, Fetched: 1 row(s)
hive>

Where clauses:


WHERE clause works like a condition. It filters the data using the condition and gives you a finite result. The built-in operators and functions generate an expression, which satisfies this condition.

Syntax:
Select [column1,column2, … ] from [Table_name] where [condition] Group by [column1,column2, …] order by [column1,column2, …] ;

hive> select item_id,date,count(*) from hadoopd_tbl where date > 2017-02-24 group by item_id,date order by date;
Query ID = hive_20170413093434_ea787a5f-326b-4503-bbb8-fd5b5c5e651c
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1492062882009_0005)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
Reducer 3 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 18.72 s
--------------------------------------------------------------------------------
OK
88-124          2017-02-25      1
7-50T           2017-02-25      1
641             2017-02-25      2
13-85           2017-02-25      2
08-76           2017-02-25      1
0474            2017-02-25      1
01322           2017-02-25      2
Time taken: 20.475 seconds, Fetched: 7 row(s)
hive>

Joins:


Join is the function used to combine 2 tables values to a single table.

Lets join the 2 tables here.

EMPLOYEE_part:

ID
Name
Salary
Country
16
john
4000
USA
17
robert
2000
USA
18
andrew
4000
USA
19
katty
2000
USA
27
edward
4000
UK
29
alan
3000
UK
31
kerry
4000
UK
34
tom
3000
UK
35
zack
2000
UK

Employee:

ID
Name
16
john
17
robert
18
andrew
19
katty
21
tom
22
tim
23
james
24
paul
27
edward
29
alan
31
kerry
34
terri

Join


JOIN operation is used to combine and retrieve the records from multiple tables.

Example:
SELECT b.ID,a.Name,a.Salary,a.country from EMPLOYEE_part as a JOIN EMPLOYEE as b ON (a.ID = b.ID);

ID
Name
Salary
Country
16
john
4000
USA
17
robert
2000
USA
18
andrew
4000
USA
19
katty
2000
USA
27
edward
4000
UK
29
alan
3000
UK
31
kerry
4000
UK
34
tom
3000
UK


Left outer join



LEFT OUTER JOIN will return all the rows from the left table, even though there are no matches in the right-side table. The ON condition matches no record in the right-side table, this JOIN will return a row in the result, but with NULL value in the columns of the right table.

Example:
SELECT b.ID,a.Name,a.Salary,a.country from EMPLOYEE_part as a LEFT OUTER JOIN EMPLOYEE as b ON (a.ID = b.ID);

ID
Name
Salary
Country
16
john
4000
USA
17
robert
2000
USA
18
andrew
4000
USA
19
katty
2000
USA
27
edward
4000
UK
29
alan
3000
UK
31
kerry
4000
UK
34
tom
3000
UK
NULL
zack
2000
UK


Right outer join


RIGHT OUTER JOIN will return all the rows from the RIGHT table, even though there are no matches in the left table. The ON condition matches no record in the left-side table, this JOIN will return a row in the result, but with NULL value in the columns of the left table.

Example:
SELECT b.ID,a.Name,a.Salary,a.country from EMPLOYEE_part as a RIGHT OUTER JOIN EMPLOYEE as b ON (a.ID = b.ID);

ID
Name
Salary
Country
16
john
4000
USA
17
robert
2000
USA
18
andrew
4000
USA
19
katty
2000
USA
21
NULL
NULL
NULL
22
NULL
NULL
NULL
23
NULL
NULL
NULL
24
NULL
NULL
NULL
27
edward
4000
UK
29
alan
3000
UK
31
kerry
4000
UK
34
tom
3000
UK


Full outer join


FULL OUTER JOIN combines all the records of both the left and the right tables.  The joined table contains all the records from both the tables and fills in NULL values for missing matched columns on both the side.

Example:
SELECT b.ID,a.Name,a.Salary,a.country from EMPLOYEE_part as a FULL OUTER JOIN EMPLOYEE as b ON (a.ID = b.ID);


ID
Name
Salary
Country
16
john
4000
USA
17
robert
2000
USA
18
andrew
4000
USA
19
katty
2000
USA
21
NULL
NULL
NULL
22
NULL
NULL
NULL
23
NULL
NULL
NULL
24
NULL
NULL
NULL
27
edward
4000
UK
29
alan
3000
UK
31
kerry
4000
UK
34
tom
3000
UK
NULL
zack
2000
UK