Hive on Tez vs Hive on MR: A Comparative Study

Introduction

Apache Hive is a data warehousing tool that provides an SQL-like interface to query and analyze large datasets stored in Hadoop Distributed File System (HDFS). Hive translates SQL-like queries into MapReduce jobs to process the data. However, MapReduce has some limitations, such as high latency and lack of optimization. As a result, Hive introduced new execution engines like Tez to overcome these limitations. This article will compare Hive on Tez and Hive on MR, discussing their differences, benefits, and providing code examples.

Hive on MR

Hive on MR is the traditional execution engine used by Hive. It translates SQL-like queries into a series of MapReduce jobs. The execution flow involves the following steps:

  1. HiveQL Query: Write a HiveQL query to perform data analysis. For example, let's find the total sales per product category in a table called sales.

    SELECT category, SUM(sales) as total_sales
    FROM sales
    GROUP BY category;
    
  2. Query Compilation: Hive compiles the HiveQL query into a Directed Acyclic Graph (DAG) of MapReduce jobs.

  3. Query Execution: Hive submits the generated MapReduce jobs to the cluster for execution. Each job performs a specific operation, such as filtering, joining, or aggregating the data.

  4. Intermediate Data: The output of each MapReduce job is stored as intermediate data in HDFS.

  5. Data Shuffle and Sort: To perform operations like joins and group by, the intermediate data is shuffled and sorted based on the keys.

  6. Final Results: The shuffled and sorted data is processed further to generate the final results, which are then returned to the user.

Hive on MR has some limitations, such as high latency due to the job submission overhead and lack of optimized execution plans.

Hive on Tez

Hive on Tez is an optimized execution engine for Hive. It replaces the multiple MapReduce jobs in Hive on MR with a single Tez DAG. The execution flow involves the following steps:

  1. HiveQL Query: Write a HiveQL query, similar to Hive on MR.

  2. Query Compilation: Hive compiles the HiveQL query into a Tez DAG, which represents the entire query execution plan.

  3. Query Execution: Hive submits the Tez DAG to the cluster for execution. The Tez framework optimizes the execution plan by reordering operations and reducing data movement.

  4. Vertex Execution: The Tez DAG is executed as a series of vertices, where each vertex represents a specific operation like filtering or joining. The vertices are executed in parallel.

  5. Data Movement Optimization: Tez optimizes the data movement by using in-memory data transfer between vertices. It also reduces disk I/O by avoiding unnecessary intermediate data storage.

  6. Final Results: The output of the Tez DAG is generated and returned to the user.

Hive on Tez provides several benefits over Hive on MR, such as reduced latency, improved performance, and optimized execution plans.

Code Examples

Hive on MR

To run a Hive query using the MapReduce execution engine, follow these steps:

  1. Launch the Hive shell:

    $ hive
    
  2. Create a table and load data:

    CREATE TABLE sales (
      id INT,
      category STRING,
      sales FLOAT
    );
    
    LOAD DATA INPATH '/path/to/data' INTO TABLE sales;
    
  3. Execute the HiveQL query:

    SELECT category, SUM(sales) as total_sales
    FROM sales
    GROUP BY category;
    

Hive on Tez

To run the same Hive query using the Tez execution engine, follow these steps:

  1. Launch the Hive shell:

    $ hive --hiveconf hive.execution.engine=tez
    
  2. Create the table and load data (same as Hive on MR).

  3. Execute the HiveQL query (same as Hive on MR).

Conclusion

Hive on Tez is a more efficient and optimized execution engine compared to Hive on MR. It leverages the Tez framework to reduce latency, improve performance, and provide optimized execution plans. By replacing the multiple MapReduce jobs with a single Tez DAG, Hive on Tez reduces the overhead of job submission and optimizes data movement and processing. As a result, Hive on Tez is the recommended choice for running Hive queries on large datasets.