Data Modeling in Hive
Apache Hive is an open source data warehouse infrastructure tool established on top of Hadoop Distributed File System (HDFS). Hive provides data summarizations, querying and ad-hoc analysis of large datasets stored in Hadoop’s HDFS. It process the structured and semi-structured data in Hadoop system. Hive is fast, scalable and highly extensible.
Earlier we used to write complex Map-Reduce jobs, later with the help of Hive we are submitting only SQL queries. Hive mainly focused on people, who are comfort in writing SQL queries. Hive gives an SQL like interface for querying, known as HiveQL or HQL. HiveQL instinctively converts SQL queries into MapReduce jobs.
Hive Architecture
Fig. Hive Architecture
Hive usually runs on our workstation and translates SQL query into a sequence of jobs for carrying out on a Hadoop cluster. Apache Hive arranges data into tables. It gives a structure to data, stored in HDFS. Data in Hive is categorized into
-
Tables
-
Partitions
-
Buckets or Clusters
Fig. Hive Data Model
Tables:
Hive table is logically made up of the data being stored. When creating Hive table, the metadata about the table is stored in Hive Metastore. The data of the table is stored as files in HDFS. Few of the operations performed on the Hive tables are:
-
filter
-
project
-
join
-
union
In Hive, Tables are two types:
(1) Internal Table (Managed Table): If internal table is dropped, both metadata (Hive Table) and data will be deleted.
(2) External Table: If external table is dropped only metadata (Hive Table) will be deleted. So in HDFS, still table’s backend directory is available with its data files.
Partitions:
Apache Hive organizes a table in to systematical partitions and categorizing same type of data together based on a column or partition key. Hive Table can have one or more partition keys to determine a specific partition. Partitions, make to run queries faster on portions of the data.
Hive Partitioning Example:
Fig1. Hive Partition and Bucketing
For example, there is a table with employee_details having the employee information like employee_id, name, salary, department etc. If we consider department column as key to do partition, the information of all the employees belonging to a particular department will be stored together in that partition. Physically, a partition in Hive is just a sub-directory in the table directory.
For instance, we have data for three departments in our employee_details table – Hr, Finance and Marketing. So total three partitions for each of the department as shown in the figure 1. For each department all the data regarding that department residing in a separate sub-directory under the table directory.
In this example, all the employee data of Hr department will be stored in user/hive/warehouse/employee_details/dept.= Hr. So, the queries regarding Hr employee, only have to look through the data present in the Hr partition. So partitioning decreases the query latency by scanning only relevant partitioned data instead of the whole data set.
Buckets:
Hive tables are categorized into partitions. Partitions are sub-categorized into buckets (clusters) based on the hash function of a column in the table. It gives additional structure to the data which helps for running a query effectively.
Hive Bucketing Example:
Hence, from figure 1, each partition is again bucketed into two buckets. Therefore, each partition, say Hr, has two files where each of them stores Hr employee’s data.