hive 执行job 数量 hive执行计划

转载

数据侠客行 2023-07-14 23:22:22

文章标签 hive 执行job 数量 Data Hive apache 文章分类 Hive 大数据

Hive是通过把HSQL转换成对应MapReduce作业，然后提交到Hadoop上执行。一条HSQL会被转化成多个MapReduce作业，每个作业被称为一个Stage，每个Stage之间有执行顺序和依赖关系，构造成一个DAG图。这些步骤可能包含：元数据的操作，文件系统的操作，MapReduce计算等等。

在HSQL语句之前加一个explain修饰可以看到SQL语句对应的那堆MapReduce作业(执行计划)和其它一些信息。注意这个查询本身是不会执行的，查询计划中的一些信息也可能是Hive帮忙估计的，比如说输入的数据条数、输出的数据条数以及数据的数据大小。这部分内容等到后面讲到Hive元数据的时候再说吧，反正知道下执行计划里面的数据不一定准确就行了。

explain查看执行计划会得到如下两部分内容：

The dependencies between the different stages of the plan(SQL语句会被划分成多少MapReduce Stage以及Stage之间的依赖关系)
The description of each of the stages(各个Stage内部的详细内容描述)

例如使用explain查看下面这条SQL的执行计划:

explain select * from
 (
         select id
         from test_table_aaa
         where dt='20210202' 
         and id is not null
         group by id
 )A
 join
 (
         select id, min(event_time) as min_event_time 
         from test_table_bbb
         where id is not null
         group by id
 )B
 on(A.id = B.id)limit 100

得到如下Stage之间的描述：

STAGE DEPENDENCIES: Stage-1 is a root stage Stage-6 depends on stages: Stage-1, Stage-3 , consists of Stage-7, Stage-8, Stage-2 Stage-7 has a backup stage: Stage-2 Stage-4 depends on stages: Stage-7 Stage-8 has a backup stage: Stage-2 Stage-5 depends on stages: Stage-8 Stage-2 Stage-3 is a root stage Stage-0 depends on stages: Stage-4, Stage-5, Stage-2

这里显示 Stage-1和Stage-3是root stage，root stage是DAG图执行的起点。默认情况下HSQL一次只能执行一个Stage，但是如果enable并行执行的话，多个相互之间没有依赖关系的Stage可以同时执行，这也是提升HSQL性能的一个方法。

Stage-6 depends on stages: Stage-1, Stage-3 表明了Stage之间的执行顺序，consists of 表示Stage6由多个部分组成。

Stage-7 has a backup stage: Stage-2 这个当前暂时还不是很了解...个人理解是如果Stage-7无法执行，那么就会选取备用的Stage-2进行执行。

可以从Yarn上看出当前这整条SQL执行的顺序为：Stage1 -> Stage3 -> Stage2：

hive 执行job 数量 hive执行计划_hive 执行job 数量

Stage中的内容解析：

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      // 表示当前的是Map阶段的操作
      Map Operator Tree:
          // 进行Hive表：test_table_aaa的扫描
          TableScan
            alias: test_table_aaa
            // 当前阶段行数和数据大小的统计信息(rows如果元数据表中不存在的话，那么Hive会帮忙估算，所以说不一定准确)
            Statistics: Num rows: 1513882 Data size: 3565192110 Basic stats: COMPLETE Column stats: NONE
            // 对数据集进行过滤，对应where条件
            Filter Operator
              // 过滤时所用的谓词
              predicate: uaid is not null (type: boolean)
              Statistics: Num rows: 756941 Data size: 1782596055 Basic stats: COMPLETE Column stats: NONE
              // 表示对过滤之后的结果集进行分组聚合操作
              Group By Operator
                // 分组聚合所使用的算法，这里用的是min()
                aggregations: min(install_time_selected_timezone)
                // 在uaid这一列上进行分组聚合
                keys: uaid (type: string)
                mode: hash
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 756941 Data size: 1782596055 Basic stats: COMPLETE Column stats: NONE
                // Map端结果进行输出
                Reduce Output Operator
                  key expressions: _col0 (type: string)
                  // 表示输出结果是否排， +表示正序，-表示倒序，一个符号对应一个列
                  sort order: +
                  // Map阶段输出到Reduce阶段的分区列
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 756941 Data size: 1782596055 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col1 (type: string)
      // 表示当前的是Reduce阶段的操作(有些SQL语句不一定会有Reduce阶段)
      Reduce Operator Tree:
        Group By Operator
          aggregations: min(VALUE._col0)
          keys: KEY._col0 (type: string)
          // 对Map端输出的结果进行最终的合并
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 378470 Data size: 891296850 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            // 文件输出结果进行压缩
            compressed: true
            // 输入输出的文件格式以及读取数据的序列化方式
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDeStage: 


Stage-2
  Map Reduce
    Map Operator Tree:
        TableScan
          // 这里就是是扫描表了，而是上一个MapReduce作业的输出结果
          Reduce Output Operator
            // Map阶段和Reduce阶段输出的都是键-值对的形式，key expression和value expressions分别描述的就是Map阶段输出的键（key）和值（value）所用的数据列
            key expressions: _col0 (type: string)
            sort order: +
            Map-reduce partition columns: _col0 (type: string)
            Statistics: Num rows: 378470 Data size: 891296850 Basic stats: COMPLETE Column stats: NONE
            value expressions: _col1 (type: string)
        TableScan
          Reduce Output Operator
            key expressions: _col0 (type: string)
            sort order: +
            Map-reduce partition columns: _col0 (type: string)
            Statistics: Num rows: 135055067 Data size: 178049618343 Basic stats: COMPLETE Column stats: NONE
            value expressions: _col1 (type: string)
    Reduce Operator Tree:
      // Join操作
      Join Operator
        // 0和1分别代表两个数据集进行join,，并且join的操作为inner join
        condition map:
             Inner Join 0 to 1
        // 两个数据集进行join的列
        keys:
          0 _col0 (type: string)
          1 _col0 (type: string)
        outputColumnNames: _col0, _col1, _col2, _col3
        Statistics: Num rows: 148560576 Data size: 195854584422 Basic stats: COMPLETE Column stats: NONE
        // 对应SQL中的limit 100
        Limit
          Number of rows: 100
          Statistics: Num rows: 100 Data size: 131800 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: true
            Statistics: Num rows: 100 Data size: 131800 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

看懂了执行计划能更好理解HSQL的执行流程，哈哈哈，慢慢学习中

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。