Hive的底层就是MapReduce的编程实现,我们可以通过执行计划详细的了解执行过程。对于我们对底层的理解,有很大的帮助。

语法及结构

官方对Hive Explain的英文解释,如果大家英文不错的话,强推:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain

首先,Explain的语法:

EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query

从语法组成可以看出来是一个“explain ”+三个可选参数+查询语句。大家可以积极尝试一下,后面两个显示内容很简单的,我介绍一下第一个 extended 这个可以显示hql语句的语法树

其次,执行计划一共有三个部分:

  • 这个语句的抽象语法树
  • 这个计划不同阶段之间的依赖关系
  • 对于每个阶段的详细描述

简单例子

我首先举一个简单的例子:select * from emp;(十分常见的一张表)

  • 展示数据
hive> select * from emp;
OK
369 SMITH   CLERK   7902    1980-12-17 00:00:00 800.0   NULL    20
7499    ALLEN   SALESMAN    7698    1981-02-20 00:00:00 1600.0  300.0   30
7521    WARD    SALESMAN    7698    1981-02-22 00:00:00 1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-04-02 00:00:00 2975.0  NULL    20
7654    MARTIN  SALESMAN    7698    1981-09-28 00:00:00 1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-05-01 00:00:00 2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-06-09 00:00:00 2450.0  NULL    10
7788    SCOTT   ANALYST 7566    1982-12-09 00:00:00 3000.0  NULL    20
7839    KING    PRESIDENT   NULL    1981-11-17 00:00:00 5000.0  NULL    10
7844    TURNER  SALESMAN    7698    1981-09-08 00:00:00 1500.0  0.0 30
7876    ADAMS   CLERK   7788    1983-01-12 00:00:00 1100.0  NULL    20
7900    JAMES   CLERK   7698    1981-12-03 00:00:00 950.0   NULL    30
7902    FORD    ANALYST 7566    1981-12-03 00:00:00 3000.0  NULL    20
7934    MILLER  CLERK   7782    1982-01-23 00:00:00 1300.0  NULL    10
Time taken: 1.305 seconds, Fetched: 14 row(s)
  • 展示执行计划
hive> explain extended select * from emp;
OK
ABSTRACT SYNTAX TREE:

TOK_QUERY
   TOK_FROM
      TOK_TABREF
         TOK_TABNAME
            emp
   TOK_INSERT
      TOK_DESTINATION
         TOK_DIR
            TOK_TMP_FILE
      TOK_SELECT
         TOK_SELEXPR
            TOK_ALLCOLREF


STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: emp
          Statistics: Num rows: 2 Data size: 820 Basic stats: COMPLETE Column stats: NONE
          GatherStats: false
          Select Operator
            expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: double), comm (type: double), deptno (type: int)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
            Statistics: Num rows: 2 Data size: 820 Basic stats: COMPLETE Column stats: NONE
            ListSink

Time taken: 1.844 seconds, Fetched: 34 row(s)
  • 第一阶段,展示抽象语法树
    首先指定表,从例子可以看出指定emp表,然后是否把查询结构插入到另一个表,由这个例子仅仅是查询,所以insert这部分为空。最后是查询的字段,由于我们写的是“*”所以展示为 TOK_ALLCOLREF全部字段。
  • 第二阶段
    展示各个阶段的依赖关系,由于我们这个查询语句过于简单,所以并没有启动MapReduce,只有一个阶段,没有显示出依赖关系
  • 第三阶段
    对Stage-0这个阶段进行详细解读

TableScan:查看表

alias: emp:所需要的表

Statistics: Num rows: 2 Data size: 820 Basic stats: COMPLETE Column stats: NONE:这张表的基本信息

expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: double), comm (type: double), deptno (type: int)
:表中需要输出的字段及类型

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
:输出的的字段编号

复杂的例子

  • 展示数据
hive> select * from dept;
OK
10  ACCOUNTING  NEW YORK
20  RESEARCH    DALLAS
30  SALES   CHICAGO
40  OPERATIONS

我们来展示一个复杂的join操作
select e.empno,e.ename,e.deptno, d.dname
from emp e join dept d on e.deptno=d.deptno;

  • 展示执行计划
hive> explain extended select e.empno,e.ename,e.deptno, d.dname
    > from emp e join dept d on e.deptno=d.deptno;
OK
ABSTRACT SYNTAX TREE:

TOK_QUERY
   TOK_FROM
      TOK_JOIN
         TOK_TABREF
            TOK_TABNAME
               emp
            e
         TOK_TABREF
            TOK_TABNAME
               dept
            d
         =
            .
               TOK_TABLE_OR_COL
                  e
               deptno
            .
               TOK_TABLE_OR_COL
                  d
               deptno
   TOK_INSERT
      TOK_DESTINATION
         TOK_DIR
            TOK_TMP_FILE
      TOK_SELECT
         TOK_SELEXPR
            .
               TOK_TABLE_OR_COL
                  e
               empno
         TOK_SELEXPR
            .
               TOK_TABLE_OR_COL
                  e
               ename
         TOK_SELEXPR
            .
               TOK_TABLE_OR_COL
                  e
               deptno
         TOK_SELEXPR
            .
               TOK_TABLE_OR_COL
                  d
               dname


STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: e
            Statistics: Num rows: 7 Data size: 820 Basic stats: COMPLETE Column stats: NONE
            GatherStats: false
            Filter Operator
              isSamplingPred: false
              predicate: deptno is not null (type: boolean)
              Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: deptno (type: int)
                sort order: +
                Map-reduce partition columns: deptno (type: int)
                Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
                tag: 0
                value expressions: empno (type: int), ename (type: string)
                auto parallelism: false
          TableScan
            alias: d
            Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
            GatherStats: false
            Filter Operator
              isSamplingPred: false
              predicate: deptno is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: deptno (type: int)
                sort order: +
                Map-reduce partition columns: deptno (type: int)
                Statistics: Num rows: 1 Data size: 80 Basic stats: COMPLETE Column stats: NONE
                tag: 1
                value expressions: dname (type: string)
                auto parallelism: false
      Path -> Alias:
        hdfs://hadoop:8020/user/hive/warehouse/dept [d]
        hdfs://hadoop:8020/user/hive/warehouse/emp [e]
      Path -> Partition:
        hdfs://hadoop:8020/user/hive/warehouse/dept 
          Partition
            base file name: dept
            input format: org.apache.hadoop.mapred.TextInputFormat
            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
            properties:
              COLUMN_STATS_ACCURATE true
              bucket_count -1
              columns deptno,dname,loc
              columns.comments 
              columns.types int:string:string
              field.delim   
              file.inputformat org.apache.hadoop.mapred.TextInputFormat
              file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              location hdfs://hadoop:8020/user/hive/warehouse/dept
              name default.dept
              numFiles 1
              numRows 0
              rawDataSize 0
              serialization.ddl struct dept { i32 deptno, string dname, string loc}
              serialization.format  
              serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              totalSize 80
              transient_lastDdlTime 1515473970
            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              properties:
                COLUMN_STATS_ACCURATE true
                bucket_count -1
                columns deptno,dname,loc
                columns.comments 
                columns.types int:string:string
                field.delim     
                file.inputformat org.apache.hadoop.mapred.TextInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                location hdfs://hadoop:8020/user/hive/warehouse/dept
                name default.dept
                numFiles 1
                numRows 0
                rawDataSize 0
                serialization.ddl struct dept { i32 deptno, string dname, string loc}
                serialization.format    
                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                totalSize 80
                transient_lastDdlTime 1515473970
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: default.dept
            name: default.dept
        hdfs://hadoop:8020/user/hive/warehouse/emp 
          Partition
            base file name: emp
            input format: org.apache.hadoop.mapred.TextInputFormat
            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
            properties:
              COLUMN_STATS_ACCURATE true
              bucket_count -1
              columns empno,ename,job,mgr,hiredate,sal,comm,deptno
              columns.comments 
              columns.types int:string:string:int:string:double:double:int
              field.delim   
              file.inputformat org.apache.hadoop.mapred.TextInputFormat
              file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              location hdfs://hadoop:8020/user/hive/warehouse/emp
              name default.emp
              numFiles 1
              numRows 0
              rawDataSize 0
              serialization.ddl struct emp { i32 empno, string ename, string job, i32 mgr, string hiredate, double sal, double comm, i32 deptno}
              serialization.format  
              serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              totalSize 820
              transient_lastDdlTime 1515411106
            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              properties:
                COLUMN_STATS_ACCURATE true
                bucket_count -1
                columns empno,ename,job,mgr,hiredate,sal,comm,deptno
                columns.comments 
                columns.types int:string:string:int:string:double:double:int
                field.delim     
                file.inputformat org.apache.hadoop.mapred.TextInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                location hdfs://hadoop:8020/user/hive/warehouse/emp
                name default.emp
                numFiles 1
                numRows 0
                rawDataSize 0
                serialization.ddl struct emp { i32 empno, string ename, string job, i32 mgr, string hiredate, double sal, double comm, i32 deptno}
                serialization.format    
                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                totalSize 820
                transient_lastDdlTime 1515411106
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: default.emp
            name: default.emp
      Truncated Path -> Alias:
        /dept [d]
        /emp [e]
      Needs Tagging: true
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 deptno (type: int)
            1 deptno (type: int)
          outputColumnNames: _col0, _col1, _col7, _col12
          Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), _col1 (type: string), _col7 (type: int), _col12 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3
            Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              GlobalTableId: 0
              directory: hdfs://hadoop:8020/tmp/hive/hadoop/ad410998-9c98-481e-9cb7-dd4262272f21/hive_2018-01-29_16-58-28_302_3010477556100147411-1/-mr-10000/.hive-staging_hive_2018-01-29_16-58-28_302_3010477556100147411-1/-ext-10001
              NumFilesPerFileSink: 1
              Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
              Stats Publishing Key Prefix: hdfs://hadoop:8020/tmp/hive/hadoop/ad410998-9c98-481e-9cb7-dd4262272f21/hive_2018-01-29_16-58-28_302_3010477556100147411-1/-mr-10000/.hive-staging_hive_2018-01-29_16-58-28_302_3010477556100147411-1/-ext-10001/
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  properties:
                    columns _col0,_col1,_col2,_col3
                    columns.types int:string:int:string
                    escape.delim \
                    hive.serialization.extend.additional.nesting.levels true
                    serialization.format 1
                    serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              TotalFiles: 1
              GatherStats: false
              MultiFileSpray: false

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.248 seconds, Fetched: 238 row(s)
  • 第一阶段
  • 在语法树中的TOK_FROM中我们可以看到这两张表及他们需要连接的条件
  • 在TOK_INSERT中因为我们没有指定将查询信息写入哪张表,所以依旧为空
  • 在TOK_SELECT中,显示了我们要查询的字段,逐个显示。
  • 第二阶段
    可以十分明显的看出依赖关系,stage-0依赖于stage-1
  • 第三阶段
    对每个stage都进行详细展示
  • stage-1:
    1.首先先将两张表分开来看,看第一张表 emp表
    2. 需要查看表,对表的基本信息查表
    3. Filter Operator
    isSamplingPred: false
    predicate: deptno is not null (type: boolean)
    进行过滤操作将detpno为空的全部过滤掉
    4. Reduce Output Operator
    key expressions: deptno (type: int)
    sort order: +
    Map-reduce partition columns: deptno (type: int)
    Statistics: Num rows: 4 Data size: 468 Basic stats: COMPLETE Column stats: NONE
    tag: 0
    value expressions: empno (type: int), ename (type: string)
    auto parallelism: false
    :展示要输入到reduce的字段主要有deptno、empno、ename。
    5. 另外一张表也不尽相同,只不过所要传入reduce的字段不同:deptno、dname。
    6. reduce阶段开始
    7. Join Operator
    condition map:
    Inner Join 0 to 1
    :join的类型
    8. keys:
    0 deptno (type: int)
    1 deptno (type: int)
    :连接字段
    9.outputColumnNames: _col0, _col1, _col7, _col12
    Statistics: Num rows: 4 Data size: 514 Basic stats: COMPLETE Column stats: NONE
    Select Operator
    expressions: _col0 (type: int), _col1 (type: string), _col7 (type: int), _col12 (type: string)
    outputColumnNames: _col0, _col1, _col2, _col3
    :输出字段,及字段的编号和类型。
    10.input format: org.apache.hadoop.mapred.TextInputFormat
    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
    :输出的格式
  • stage-0:主要做一些收尾作业