Monday 10 April 2017

Hive - Basics, Create and update tables in hive, Data types available in Hive

Hive is a SQL like query language to querying and analyzing data on Hadoop distributed file system. When there was a bottleneck where Hadoop can be used only by the java known peoples, facebook came up with the SQL like language where database engineers to get some benefit of the Hadoop.

 So, Hive is originally developed by Facebook is a data warehouse built on top of Hadoop and provides the SQL like query language called HiveQL for querying data, data summarization and analysis from the Hadoop eco system. It also Hive makes querying faster through indexing. which makes the database engineers to get into the Hadoop distributed file system.


To the simpler note, Hive is a data warehouse application that build on top of the Hadoop Distributed file system. Hive can interact with the many parts of the HDFS. Hive is a SQL like language it easy for the analyst and the data engineers but Hive internally it converts as a MapReduce program will get the major advantage of the Distributed computing and parallel processing concepts. 

Getting starting with Hive:


Login with the Hive Shell


[hive@sandbox ~]$ hive

Logging initialized using configuration in file:/etc/hive/conf/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.4.2-2/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.4.2-2/hive/lib/hive-jdbc-0.14.0.2.2.4.2-2-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hive>

Show databases:


hive> show databases;
OK
default
ipl
xademo
Time taken: 0.306 seconds, Fetched: 3 row(s)
hive>


Create databases:


hive> create database hadoopd_demo;
OK
Time taken: 0.658 seconds
hive>

Use database:


hive> use hadoopd_demo;
OK
Time taken: 0.429 seconds
hive>

show tables:


hive> show databases;
OK
default
hadoopd_demo
ipl
xademo
Time taken: 0.163 seconds, Fetched: 4 row(s)

hive> use hadoopd_demo;
OK
Time taken: 0.432 seconds
hive> show tables;
OK
Time taken: 0.582 seconds

hive> use default;
OK
Time taken: 0.366 seconds

hive> show tables;
OK
hadoopd_tbl
sample_07
sample_08
Time taken: 0.196 seconds, Fetched: 3 row(s)
hive>


Create tables

Table creation syntax shown as below. Each query must be terminated with the semicolon(;).

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name1 data_type [COMMENT col_comment],
col_name2 data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format];


Copy the data to HDFS and Create the table based on the data.


[hive@sandbox tmp]$ head online.txt
88-124          2017-02-25      69.0
13-92           2017-02-25      9.0
13-85           2017-02-25      35.0
08-76           2017-02-25      3.0
01322           2017-02-25      10.0
10539           2017-02-25      050.0
0474            2017-02-25      78.0
L1              2017-02-25      30.0
7-50T           2017-02-25      57.0
641             2017-02-25      00.0
[hive@sandbox tmp]$ hdfs dfs -put online.txt /tmp/online/

Creating the hive table:


hive> CREATE EXTERNAL TABLE IF NOT EXISTS hadoopd_tbl (
    > item_id STRING,
    > date date,
    > clicks float
    > )
    > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    > LOCATION '/tmp/hadoopd/';
OK
Time taken: 0.488 seconds
hive>

Show table after the table creation to make sure the table created:


hive> show tables;
OK
hadoopd_tbl
sample_07
sample_08
Time taken: 0.192 seconds, Fetched: 3 row(s)
hive>


Data types:


There are numerus data type available to define the column based on the data. Hive is basically case insensitive.

INT:

Integer type data can be specified using INT. When the data range exceeds, we need to go for the BIGINT and if the data range is small then use SMALLINT. Also, TINYINT available for the data which is smaller than the SMALLINT.

Syntax: Column_Name  INT

STRING:

String type data types can be specified as single quotes (' ') or double quotes (" ").  This STRING data type contains two data types: VARCHAR and CHAR.

Syntax: Column_Name  STRING

Timestamp:


It supports UNIX timestamp upto the nanosecond. Also, It supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.

Syntax: Column_Name  TIMESTAMP

Dates:

DATE values are described in the form of year-month-day {{YYYY-MM-DD}}.

Syntax: Column_Name  DATE

Decimals:

The DECIMAL type is as same as Big Decimal format of Java. It is used for representing immutable arbitrary precision.

Floating Point Types:

Floating point type is nothing but numbers with decimal point values. Generally, this type of data is composed of DOUBLE data type.

Syntax: Column_Name  FLOAT

Decimal Type:

Decimal type data is nothing but floating point value with higher range than DOUBLE data type. The range of decimal type is approximately -10-308 to 10308.

Null Value:

Missing values are represented as the special value NULL.

Arrays:

Arrays in Hive are similar to the array in the Java.

Syntax: ARRAY  <data_type>

Maps:

Maps in Hive are similar to the Java Maps.

Syntax: MAP  <primitive_type, data_type>

Structs:

Structs are similar to using complex data with comment.

Syntax: STRUCT <col_name : data_type [COMMENT col_comment], ...>

Alter the table:

It is used to alter a table in Hive.

Rename the table Name:

Renaming the table Name in the Hive.

Syntax:
ALTER TABLE name RENAME TO new_name;

Example:
hive> ALTER TABLE hadoopd_tbl RENAME to hadoopd_demo;
OK
Time taken: 0.884 seconds
hive>

Add the column to the table:

Add the additional columns to the table.

Syntax:
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])

Example:
hive> alter table hadoopd_demo add columns ( Customer_ID STRING );
OK
Time taken: 0.757 seconds
hive> describe hadoopd_demo;
OK
item_id                 string
date                    date
clicks                  float
customer_id             string
Time taken: 0.935 seconds, Fetched: 4 row(s)
hive>

Alter the data type:

Changing the data type of the columns.

Syntax:
ALTER TABLE name CHANGE column_name new_name new_type;

Example:
hive> describe hadoopd_demo;
OK
item_id                 string
date                    date
clicks                  float
customer_id             string
Time taken: 0.857 seconds, Fetched: 4 row(s)

hive> ALTER TABLE hadoopd_demo CHANGE customer_id customer_id INT AFTER clicks;
OK
Time taken: 0.683 seconds

hive> describe hadoopd_demo;
OK
item_id                 string
date                    date
clicks                  float
customer_id             int
Time taken: 0.758 seconds, Fetched: 4 row(s)
hive>

Alter the table with columns and its datatypes:

Changing the columns and its data types.

Syntax:
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

Example:
hive> ALTER TABLE hadoopd_demo REPLACE COLUMNS( item_id STRING, date date, clicks float);
OK
Time taken: 0.632 seconds

hive> describe hadoopd_demo;
OK
item_id                 string
date                    date
clicks                  float
Time taken: 0.806 seconds, Fetched: 3 row(s)
hive>

Drop the table and database


Drop table:

Dropping the table from the database.

Syntax:
drop table table_Name;

Example:
hive> drop table hadoopd_demo;
OK
Time taken: 0.817 seconds
hive>

Drop database:

Dropping the database from Hive.

Syntax:
Drop database database_Name;

Example:
hive> show databases;
OK
default
hadoopd_demo
ipl
xademo
Time taken: 0.091 seconds, Fetched: 4 row(s)

hive> drop database hadoopd_demo;
OK
Time taken: 0.606 seconds

hive> show databases;
OK
default
ipl
xademo
Time taken: 0.058 seconds, Fetched: 3 row(s)
hive>

No comments:

Post a Comment