Saturday, March 14, 2020

Hive Partitions, Types of Hive Partitioning with Examples

1. Objective

The Hive tutorial explains about the Hive partitions. This blog will help you to answer what is Hive partitioning, what is the need of partitioning, how it improves the performance? Partitioning is the optimization technique in Hive which improves the performance significantly. Apache Hive is the data warehouse on the top of Hadoop, which enables ad-hoc analysis over structured and semi-structured data. Let’s discuss Apache Hive partitioning in detail.
So, let’s start the Hive Partitions tutorial.
2. What are the Hive Partitions?
Apache Hive organizes tables into partitions. Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department. Each table in the hive can have one or more partition keys to identify a particular partition. Using partition it is easy to do queries on slices of the data.
3. Why is Partitioning Important?
In the current century, we know that the huge amount of data which is in the range of petabytes is getting stored in HDFS. So due to this, it becomes very difficult for Hadoop users to query this huge amount of data.
The Hive was introduced to lower down this burden of data querying. Apache Hive converts the SQL queries into MapReduce jobs and then submits it to the Hadoop cluster. When we submit a SQL query, Hive read the entire data-set. So, it becomes inefficient to run MapReduce jobs over a large table. Thus this is resolved by creating partitions in tables. Apache Hive makes this job of implementing partitions very easy by creating partitions by its automatic partition scheme at the time of table creation.
In Partitioning method, all the table data is divided into multiple partitions. Each partition corresponds to a specific value(s) of partition column(s). It is kept as a sub-record inside the table’s record present in the HDFS. Therefore on querying a particular table, appropriate partition of the table is queried which contains the query value. Thus this decreases the I/O time required by the query. Hence increases the performance speed.

4. How to Create Partitions in Hive?

To create data partitioning in Hive following command is used-
CREATE TABLE table_name (column1 data_type, column2 data_type) PARTITIONED BY (partition1 data_type, partition2 data_type,….);

5. Hive Data Partitioning Example

Now let’s understand data partitioning in Hive with an example. Consider a table named Tab1. The table contains client detail like id, name, dept, and yoj( year of joining). Suppose we need to retrieve the details of all the clients who joined in 2012. Then, the query searches the whole table for the required information. But if we partition the client data with the year and store it in a separate file, this will reduce the query processing time. The below example will help us to learn how to partition a file and its data-
The file name says file1 contains client data table:
  1. tab1/clientdata/file1
  2. id, name, dept, yoj
  3. 1, sunny, SC, 2009
  4. 2, animesh, HR, 2009
  5. 3, sumeer, SC, 2010
  6. 4, sarthak, TP, 2010

Now, let us partition above data into two files using years
  1. tab1/clientdata/2009/file2
  2. 1, sunny, SC, 2009
  3. 2, animesh, HR, 2009
  4. tab1/clientdata/2010/file3
  5. 3, sumeer, SC, 2010
  6. 4, sarthak, TP, 2010

Now when we are retrieving the data from the table, only the data of the specified partition will be queried. Creating a partitioned table is as follows:
  1. CREATE TABLE table_tab1 (id INT, name STRING, dept STRING, yoj INT) PARTITIONED BY (year STRING);
  2. LOAD DATA LOCAL INPATH tab1’/clientdata/2009/file2’OVERWRITE INTO TABLE studentTab PARTITION (year='2009');
  3. LOAD DATA LOCAL INPATH tab1’/clientdata/2010/file3’OVERWRITE INTO TABLE studentTab PARTITION (year='2010');

6. Types of Hive Partitioning

Till now we have discussed Introduction to Hive Partitions and How to create Hive partitions. Now we are going to introduce the types of data partitioning in Hive. There are two types of Partitioning in Apache Hive-
  • Static Partitioning
  • Dynamic Partitioning
Let’s discuss these types of Hive Partitioning one by one-

i. Hive Static Partitioning

  • Insert input data files individually into a partition table is Static Partition.
  • Usually when loading files (big files) into Hive tables static partitions are preferred.
  • Static Partition saves your time in loading data compared to dynamic partition.
  • You “statically” add a partition in the table and move the file into the partition of the table.
  • We can alter the partition in the static partition.
  • You can get the partition column value from the filename, day of date etc without reading the whole big file.
  • If you want to use the Static partition in the hive you should set property set hive.mapred.mode = strict This property set by default in hive-site.xml
  • Static partition is in Strict Mode.
  • You should use where clause to use limit in the static partition.
  • You can perform Static partition on Hive Manage table or external table.

ii. Hive Dynamic Partitioning

  • Single insert to partition table is known as a dynamic partition.
  • Usually, dynamic partition loads the data from the non-partitioned table.
  • Dynamic Partition takes more time in loading data compared to static partition.
  • When you have large data stored in a table then the Dynamic partition is suitable.
  • If you want to partition a number of columns but you don’t know how many columns then also dynamic partition is suitable.
  • Dynamic partition there is no required where clause to use limit.
  • we can’t perform alter on the Dynamic partition.
  • You can perform dynamic partition on hive external table and managed table.
  • If you want to use the Dynamic partition in the hive then the mode is in non-strict mode.
  • Here are Hive dynamic partition properties you should allow

7. Hive Partitioning – Advantages and Disadvantages

Let’s discuss some benefits and limitations of Apache Hive Partitioning-
a) Hive Partitioning Advantages
  • Partitioning in Hive distributes execution load horizontally.
  • In partition faster execution of queries with the low volume of data takes place. For example, search population from Vatican City returns very fast instead of searching entire world population.
b) Hive Partitioning Disadvantages
  • There is the possibility of too many small partition creations- too many directories.
  • Partition is effective for low volume data. But there some queries like group by on high volume of data take a long time to execute. For example, grouping population of China will take a long time as compared to a grouping of the population in Vatican City.
  • There is no need for searching entire table column for a single record.
So, this was all in Hive Partitions. Hope you like our explanation.

8. Conclusion – Hive Partitions

Hope this blog will help you a lot to understand what exactly is partition in Hive, what is Static partitioning in Hive, What is Dynamic partitioning in Hive. We have also covered various advantages and disadvantages of Hive partitioning. If you have any query related to Hive Partitions, so please leave a comment. We will be glad to solve them.

Friday, February 28, 2020

Hive Installation and Launch !

Hive installation on Windows 10





Hive Introduction


In reference to Hadoop and HBase outline as well installation over Windows environment, already we have been talked and gone through the same in my previous post. We came to know that Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. It does mean one has to search the entire data-set even for the simplest of jobs. 

In such scenario, a huge data-set when processed results in another huge data set, which should also be processed sequentially. At this point, a new solution is needed to access any point of data in a single unit of time (random access). Here the HBase can store massive amounts of data from terabytes to petabytes and allows fast random reads and writes that cannot be handled by the Hadoop. 

HBase is an open source non-relational (NoSQL) distributed column-oriented database that runs on top of HDFS and real-time read/write access to those large data-sets. Initially, it was Google Big Table, afterwards it was re-named as HBase and is primarily written in Java, designed to provide quick random access to huge amounts of the data-set.

Next in this series, we will walk through Apache Hive, the Hive is a data warehouse infrastructure work on Hadoop Distributed File System and MapReduce to encapsulate Big Data, and makes querying and analyzing stress-free. In fact, it is an ETL tool for Hadoop ecosystem, enables developers to write Hive Query Language (HQL) statements very similar to SQL statements.

Hive Installation


In brief, Hive is a data warehouse software project built on top of Hadoop, that facilitate reading, writing, and managing large datasets residing in distributed storage using SQL. Honestly, before moving ahead, it is essential to install Hadoop first, I am considering Hadoop is already installed, if not, then go to my previous post how to install Hadoop on Windows environment.

I went through Hive (2.1.0) installation on top of Derby Metastore (10.12.1.1), though you can use any stable version.

Download Hive 2.1.0
  • https://archive.apache.org/dist/hive/hive-2.1.0/


Download Derby Metastore 10.12.1.1
  • https://archive.apache.org/dist/db/derby/db-derby-10.12.1.1/


Download hive-site.xml
  • https://mindtreeonline-my.sharepoint.com/:u:/g/personal/m1045767_mindtree_com1/EbsE-U5qCIhIpo8AmwuOzwUBstJ9odc6_QA733OId5qWOg?e=2X9cfX
  • https://drive.google.com/file/d/1qqAo7RQfr5Q6O-GTom6Rji3TdufP81zd/view?usp=sharing


STEP - 1: Extract the Hive file


Extract file apache-hive-2.1.0-bin.tar.gz and place under "D:\Hive", you can use any preferred location – 

[1] You will get again a tar file post extraction – 

Hive folder

[2] Go inside of apache-hive-2.1.0-bin.tar folder and extract again – 

Extracted folder

[3] Copy the leaf folder “apache-hive-2.1.0-bin” and move to the root folder "D:\Hive" and removed all other files and folders – 

Hive folder details

STEP - 2: Extract the Derby file


Similar to Hive, extract file db-derby-10.12.1.1-bin.tar.gz and place under "D:\Derby", you can use any preferred location –

Derby folder
Derby folder details

STEP - 3: Moving hive-site.xml file


Drop the downloaded file “hive-site.xml” to hive configuration location “D:\Hive\apache-hive-2.1.0-bin\conf”. 

Hive-site.xml

STEP - 4: Moving Derby libraries


Next, need to drop all derby library to hive library location – 
[1] Move to library folder under derby location D:\Derby\db-derby-10.12.1.1-bin\lib.

Derby libraries

[2] Select all and copy all libraries.

[3] Move to library folder under hive location D:\Hive\apache-hive-2.1.0-bin\lib.

Moved libraries

[4] Drop all selected libraries here.

Libraries



STEP - 5: Configure Environment variables


Set the path for the following Environment variables (User Variables) on windows 10 – 
  • HIVE_HOME - D:\Hive\apache-hive-2.1.0-bin
  • HIVE_BIN - D:\Hive\apache-hive-2.1.0-bin\bin
  • HIVE_LIB - D:\Hive\apache-hive-2.1.0-bin\lib
  • DERBY_HOME - D:\Derby\db-derby-10.12.1.1-bin
  • HADOOP_USER_CLASSPATH_FIRST - true


This PC - > Right Click - > Properties - > Advanced System Settings - > Advanced - > Environment Variables 

HIVE_HOME


HIVE_BIN

HIVE_LIB

DERBY_HOME

HADOOP_USER

STEP - 6: Configure System variables


Next onward need to set System variables, including Hive bin directory path – 

HADOOP_USER_CLASSPATH_FIRST - true
Variable: Path 
Value: 
  1. D:\Hive\apache-hive-2.1.0-bin\bin
  2. D:\Derby\db-derby-10.12.1.1-bin\bin


System Variables

STEP - 7: Working with hive-site.xml


Now need to do a cross check with Hive configuration file for Derby details – 
  • hive-site.xml


[1] Edit file D:/Hive/apache-hive-2.1.0-bin/conf/hive-site.xml, paste below xml paragraph and save this file.

<configuration>
<property> 
<name>javax.jdo.option.ConnectionURL</name> 
<value>jdbc:derby://localhost:1527/metastore_db;create=true</value> 
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property> 
<name>javax.jdo.option.ConnectionDriverName</name> 
<value>org.apache.derby.jdbc.ClientDriver</value> 
<description>Driver class name for a JDBC metastore</description>
</property>
<property> 
<name>hive.server2.enable.impersonation</name> 
<description>Enable user impersonation for HiveServer2</description>
<value>true</value>
</property>
<property>
<name>hive.server2.authentication</name> 
<value>NONE</value>
<description> Client authentication types. NONE: no authentication check LDAP: LDAP/AD based authentication KERBEROS: Kerberos/GSSAPI authentication CUSTOM: Custom authentication provider (Use with property hive.server2.custom.authentication.class) </description>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>True</value>
</property> 
</configuration>

STEP - 8: Start the Hadoop


Here need to start Hadoop first -

Open command prompt and change directory to “D:\Hadoop\hadoop-2.8.0\sbin" and type "start-all.cmd" to start apache.

Start Hadoop



It will open four instances of cmd for following tasks – 
  • Hadoop Datanaode
  • Hadoop Namenode
  • Yarn Nodemanager
  • Yarn Resourcemanager


 Hadoop Started



It can be verified via browser also as – 
  • Namenode (hdfs) - http://localhost:50070 
  • Datanode - http://localhost:50075
  • All Applications (cluster) - http://localhost:8088 etc.


Hadoop In Browser

Since the ‘start-all.cmd’ command has been deprecated so you can use below command in order wise - 
  • “start-dfs.cmd” and 
  • “start-yarn.cmd”


STEP - 9: Start Derby server


Post successful execution of Hadoop, change directory to “D:\Derby\db-derby-10.12.1.1-bin\bin” and type “startNetworkServer -h 0.0.0.0” to start derby server.

Start Derby

Derby Started

STEP - 10: Start the Hive


Derby server has been started and ready to accept connection so open a new command prompt under administrator privileges and move to hive directory as “D:\Hive\apache-hive-2.1.0-bin\bin” – 

[1] Type “jps -m” to check NetworkServerControl

Validate server

[2] Type “hive” to execute hive server.

Start Hive

Hive Started

Congratulations, Hive installed !! 😊

STEP-11: Some hands on activities


[1] Create Database in Hive - 
CREATE DATABASE IF NOT EXISTS TRAINING;

Create database

[2] Show Database - 
SHOW DATABASES;

Show Databases

[3] Creating Hive Tables - 
CREATE TABLE IF NOT EXISTS testhive(col1 char(10), col2 char(20));

Create table

Create table

[4] DESCRIBE Table Command in Hive - 
Describe Students

describe students

[5] Usage of LOAD Command for Inserting Data Into Hive Tables
Create a sample text file using ‘|’ delimiter – 

text file

Load data

[6] Hive Select Data from Table - 
SELECT * FROM STUDENTS;

Select


Stay in touch for more posts