hive 列去重 hive怎么去重

转载

goody 2024-06-21 16:55:21

文章标签 hive 列去重 hive apache hadoop 文章分类 Hive 大数据

测试表以及测试数据

+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE `datacube_salary_org`(                |
|   `company_name` string COMMENT '????',            |
|   `dep_name` string COMMENT '????',                |
|   `user_id` bigint COMMENT '??id',                 |
|   `user_name` string COMMENT '????',               |
|   `salary` decimal(10,2) COMMENT '??',             |
|   `create_time` date COMMENT '????',               |
|   `update_time` date COMMENT '????')               |
| PARTITIONED BY (                                   |
|   `pt` string COMMENT '????')                      |
| ROW FORMAT SERDE                                   |
|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
| WITH SERDEPROPERTIES (                             |
|   'field.delim'=',',                               |
|   'serialization.format'=',')                      |
| STORED AS INPUTFORMAT                              |
|   'org.apache.hadoop.mapred.TextInputFormat'       |
| OUTPUTFORMAT                                       |
|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION                                           |
|   'hdfs://cdh-manager:8020/user/hive/warehouse/data_warehouse_test.db/datacube_salary_org' |
| TBLPROPERTIES (                                    |
|   'transient_lastDdlTime'='1586310488')            |
+----------------------------------------------------+

+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| datacube_salary_org.company_name  | datacube_salary_org.dep_name  | datacube_salary_org.user_id  | datacube_salary_org.user_name  | datacube_salary_org.salary  | datacube_salary_org.create_time  | datacube_salary_org.update_time  | datacube_salary_org.pt  |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+
| s.zh                              | engineer                      | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| s.zh                              | engineer                      | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200405                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200405                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200405                |
| s.zh                              | engineer                      | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| s.zh                              | engineer                      | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200406                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200406                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200406                |
| s.zh                              | enginer                       | 1                            | szh                            | 28000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| s.zh                              | enginer                       | 2                            | zyq                            | 26000.00                    | 2020-04-03                       | 2020-04-03                       | 20200407                |
| s.zh                              | tester                        | 3                            | gkm                            | 20000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 4                            | pip                            | 13400.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 5                            | kip                            | 24500.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | finance                       | 6                            | zxxc                           | 13000.00                    | 2020-04-07                       | 2020-04-07                       | 20200407                |
| x.qx                              | kiccp                         | 7                            | xsz                            | 8600.00                     | 2020-04-07                       | 2020-04-07                       | 20200407                |
+-----------------------------------+-------------------------------+------------------------------+--------------------------------+-----------------------------+----------------------------------+----------------------------------+-------------------------+

场景一 .去重场景问题

1) UNION -- UNION ALL 之间的区别，如何取舍

2) DISTINCT 替代方式 GROUP BY

1) UNION -- UNION ALL 之间的区别，如何取舍

注意SQL 中 UNION ALL 与 UNION 是不一样的，

UNION ALL 不会对合并的数据去重

UNION 会对合并的数据去重

例子：

EXPLAIN
SELECT 
 company_name
 ,dep_name
 ,user_id
 ,user_name
FROM datacube_salary_org
WHERE pt = '20200405'
UNION / UNION ALL
SELECT
  company_name
 ,dep_name
 ,user_id
 ,user_name
FROM datacube_salary_org
WHERE pt = '20200406'
;

UNION ALL 的 EXPLAIN 结果

INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200409232517_c76f15cf-20cf-415d-8086-123953fffc75); Time taken: 0.006 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             filterExpr: (pt = '20200405') (type: boolean) |
|             Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string) |
|               outputColumnNames: _col0, _col1, _col2, _col3 |
|               Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE |
|               Union                                |
|                 Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|                 File Output Operator               |
|                   compressed: false                |
|                   Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|                   table:                           |
|                       input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                       output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             filterExpr: (pt = '20200406') (type: boolean) |
|             Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string) |
|               outputColumnNames: _col0, _col1, _col2, _col3 |
|               Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE |
|               Union                                |
|                 Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|                 File Output Operator               |
|                   compressed: false                |
|                   Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|                   table:                           |
|                       input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                       output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

UNION 的 EXPLAIN 结果

INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200409232436_8c1754b6-36ef-4846-a6db-719211b6b6a8); Time taken: 0.022 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             filterExpr: (pt = '20200405') (type: boolean) |
|             Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string) |
|               outputColumnNames: _col0, _col1, _col2, _col3 |
|               Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE |
|               Union                                |
|                 Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
|                   mode: hash                       |
|                   outputColumnNames: _col0, _col1, _col2, _col3 |
|                   Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
|                     sort order: ++++               |
|                     Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
|                     Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             filterExpr: (pt = '20200406') (type: boolean) |
|             Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string) |
|               outputColumnNames: _col0, _col1, _col2, _col3 |
|               Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE |
|               Union                                |
|                 Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
|                   mode: hash                       |
|                   outputColumnNames: _col0, _col1, _col2, _col3 |
|                   Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
|                     sort order: ++++               |
|                     Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
|                     Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: bigint), KEY._col3 (type: string) |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1, _col2, _col3 |
|           Statistics: Num rows: 1 Data size: 377 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 1 Data size: 377 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

对比两个的EXPLAIN 结果，我们不难发现，UNION 会多出一个Reduce 流程。这也不难理，为什么在无去重需求下，使用 UNION ALL 而不是 UNION 。

另外据说使用 UNION ALL ，再去使用 GROUP BY 去做去重效果会比 UNION 效率要更高。

SELECT 
 company_name
 ,dep_name
 ,user_id
 ,user_name
FROM datacube_salary_org
WHERE pt = '20200405'
UNION 
SELECT
  company_name
 ,dep_name
 ,user_id
 ,user_name
FROM datacube_salary_org
WHERE pt = '20200406'
;

改为

SELECT
 company_name
 ,dep_name
 ,user_id
 ,user_name
FROM 
(
SELECT 
 company_name
 ,dep_name
 ,user_id
 ,user_name
FROM datacube_salary_org
WHERE pt = '20200405'
UNION ALL
SELECT
  company_name
 ,dep_name
 ,user_id
 ,user_name
FROM datacube_salary_org
WHERE pt = '20200406'
) tmp
GROUP BY 
 company_name
 ,dep_name
 ,user_id
 ,user_name
;

我认为效率一致，看下改进方式的 EXPLAIN 结果

INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200410020255_57b936d7-ffde-41a6-af6e-3d0dc0d3a007); Time taken: 0.015 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             filterExpr: (pt = '20200405') (type: boolean) |
|             Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string) |
|               outputColumnNames: _col0, _col1, _col2, _col3 |
|               Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE |
|               Union                                |
|                 Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
|                   mode: hash                       |
|                   outputColumnNames: _col0, _col1, _col2, _col3 |
|                   Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
|                     sort order: ++++               |
|                     Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
|                     Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             filterExpr: (pt = '20200406') (type: boolean) |
|             Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string) |
|               outputColumnNames: _col0, _col1, _col2, _col3 |
|               Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE |
|               Union                                |
|                 Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   keys: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
|                   mode: hash                       |
|                   outputColumnNames: _col0, _col1, _col2, _col3 |
|                   Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
|                     sort order: ++++               |
|                     Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string) |
|                     Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: bigint), KEY._col3 (type: string) |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1, _col2, _col3 |
|           Statistics: Num rows: 1 Data size: 377 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 1 Data size: 377 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

两个方式的EXPLAIN 无区别，故认为没优化

对比下时间（小数据量级）

UNION ALL 再 GROUP BY

耗时 5.2s

INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-10 02:06:37,784 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-10 02:06:44,970 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 1.67 sec
INFO  : 2020-04-10 02:06:49,094 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.23 sec
INFO  : 2020-04-10 02:06:55,291 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.2 sec
INFO  : MapReduce Total cumulative CPU time: 5 seconds 200 msec
INFO  : Ended Job = job_1586423165261_0005
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 5.2 sec   HDFS Read: 21685 HDFS Write: 304 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 5 seconds 200 msec
INFO  : Completed executing command(queryId=hive_20200410020629_c216e339-181a-4b52-8a59-ac527963e32b); Time taken: 28.112 seconds
INFO  : OK
+---------------+-----------+----------+------------+
| company_name  | dep_name  | user_id  | user_name  |
+---------------+-----------+----------+------------+
| s.zh          | engineer  | 1        | szh        |
| s.zh          | engineer  | 2        | zyq        |
| s.zh          | tester    | 3        | gkm        |
| x.qx          | finance   | 4        | pip        |
| x.qx          | finance   | 5        | kip        |
| x.qx          | finance   | 6        | zxxc       |
| x.qx          | kiccp     | 7        | xsz        |
+---------------+-----------+----------+------------+
7 rows selected (28.31 seconds)

UNION

INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2020-04-10 02:09:24,102 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-10 02:09:31,308 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 1.78 sec
INFO  : 2020-04-10 02:09:35,427 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.39 sec
INFO  : 2020-04-10 02:09:41,582 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.04 sec
INFO  : MapReduce Total cumulative CPU time: 5 seconds 40 msec
INFO  : Ended Job = job_1586423165261_0006
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 5.04 sec   HDFS Read: 21813 HDFS Write: 304 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 5 seconds 40 msec
INFO  : Completed executing command(queryId=hive_20200410020915_477574a0-4763-4717-8f9c-25d9f4b04706); Time taken: 27.033 seconds
INFO  : OK
+-------------------+---------------+--------------+----------------+
| _u2.company_name  | _u2.dep_name  | _u2.user_id  | _u2.user_name  |
+-------------------+---------------+--------------+----------------+
| s.zh              | engineer      | 1            | szh            |
| s.zh              | engineer      | 2            | zyq            |
| s.zh              | tester        | 3            | gkm            |
| x.qx              | finance       | 4            | pip            |
| x.qx              | finance       | 5            | kip            |
| x.qx              | finance       | 6            | zxxc           |
| x.qx              | kiccp         | 7            | xsz            |
+-------------------+---------------+--------------+----------------+

经过以上对比，可以认为无差别

2) DISTINCT 替代方式 GROUP BY

在实际的去重场景中，我们会选用 DISTINCT 去做去重。

但是实际场景下，选择 GROUP BY 效率会更高。下面我们进行下实验。

我们先选用低效率的 COUNT(DISTINCT ) 方式

SQL

SELECT 
 COUNT(DISTINCT company_name, dep_name, user_id)
FROM datacube_salary_org
;

EXPLAIN 结果

INFO  : Starting task [Stage-2:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200410023914_3ed9bbfc-9b01-4351-b559-a797b8ae2c85); Time taken: 0.007 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint) |
|               outputColumnNames: company_name, dep_name, user_id |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: count(DISTINCT company_name, dep_name, user_id) |
|                 keys: company_name (type: string), dep_name (type: string), user_id (type: bigint) |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1, _col2, _col3 |
|                 Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint) |
|                   sort order: +++                  |
|                   Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(DISTINCT KEY._col0:0._col0, KEY._col0:0._col1, KEY._col0:0._col2) |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

小数据量运行时间

INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2020-04-10 03:06:39,390 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-04-10 03:06:46,735 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.94 sec
INFO  : 2020-04-10 03:06:52,969 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.72 sec
INFO  : MapReduce Total cumulative CPU time: 4 seconds 720 msec
INFO  : Ended Job = job_1586423165261_0010
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.72 sec   HDFS Read: 12863 HDFS Write: 101 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 4 seconds 720 msec
INFO  : Completed executing command(queryId=hive_20200410030629_7b6df91e-a78a-4bc1-b558-abbb8d506596); Time taken: 24.023 seconds
INFO  : OK
+------+
| _c0  |
+------+
| 9    |
+------+

====================

我们再选用高效率的 GROUP BY 方式

SQL

SELECT COUNT(1)
FROM (
SELECT 
 company_name
 ,dep_name
 ,user_id
FROM datacube_salary_org
GROUP BY
 company_name
 ,dep_name
 ,user_id
) AS tmp
;

EXPLAIN 结果

INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing command(queryId=hive_20200410024128_fc60e84d-be8d-4b4d-aad8-a53466fa1559); Time taken: 0.005 seconds
INFO  : OK
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-2 depends on stages: Stage-1               |
|   Stage-0 depends on stages: Stage-2               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: datacube_salary_org             |
|             Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint) |
|               outputColumnNames: company_name, dep_name, user_id |
|               Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 keys: company_name (type: string), dep_name (type: string), user_id (type: bigint) |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1, _col2 |
|                 Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint) |
|                   sort order: +++                  |
|                   Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint) |
|                   Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: bigint) |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1, _col2   |
|           Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE |
|             Group By Operator                      |
|               aggregations: count(1)               |
|               mode: hash                           |
|               outputColumnNames: _col0             |
|               Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
|                                                    |
|   Stage: Stage-2                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             Reduce Output Operator                 |
|               sort order:                          |
|               Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: _col0 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
|                                                    |
+----------------------------------------------------+

小数据量运行时间

INFO  : Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
INFO  : 2020-04-10 03:09:34,476 Stage-2 map = 0%,  reduce = 0%
INFO  : 2020-04-10 03:09:40,662 Stage-2 map = 100%,  reduce = 0%
INFO  : 2020-04-10 03:09:47,850 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 4.3 sec
INFO  : MapReduce Total cumulative CPU time: 4 seconds 300 msec
INFO  : Ended Job = job_1586423165261_0014
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.11 sec   HDFS Read: 11827 HDFS Write: 114 SUCCESS
INFO  : Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 4.3 sec   HDFS Read: 5111 HDFS Write: 101 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 8 seconds 410 msec
INFO  : Completed executing command(queryId=hive_20200410030859_f89c708b-e76a-44fc-9e99-a6f9a404200f); Time taken: 49.78 seconds
INFO  : OK
+------+
| _c0  |
+------+
| 9    |

优化原理

我们先说下为什么大数据集下先 GROUP BY 再COUNT 的效率要优于直接 COUNT(DISTINCT ...) .

因为 COUNT(DISTINCT ...) , 会把相关的列组成一个key 传入到 Reducer 中。即 count(DISTINCT KEY._col0:0._col0, KEY._col0:0._col1, KEY._col0:0._col2) | 这样需要在一个 Reducer 中，完成全排序并去重。

先GROUP BY 再去 COUNT ，则GROUP BY 可以将不同的KEY , 分发到多个 Reducer 中，在 GROUP BY流程中完成了去重。此时，去重时并不会把数据放入到一个 Reducer 中，利用了分布式的优势。这个去重效率更高。在下一步 COUNT 阶段，再将上一步奏 GROUP BY 去重后的 KEY , 进行统计计算。

所以大数据量下先GROUP BY ，再去 COUNT 效率比 COUNT(DISTINCT) 更高。

我们对比下上述的运行结果

EXPLAIN 中：COUNT(DISTINCT ) 比先GROUP BY 再 COUNT 的阶段少。因为 GROUP BY 已经是一个 MR STAGE, 而 COUNT 是另一个 STAGE.

运行时间上：可以看到两者并无差别，甚至 COUNT(DISTINCT ) 总时间小于先GROUP BY 再 COUNT。这是因为，运行一个 STAGE 需要申请资源，开辟资源，有时间成本。故小数据量下，先GROUP BY 再 COUNT 时间多于 COUNT(DISTINCT ) , 主要是花费在申请资源，创建容器的时间上。

并且总运行时间 COUNT(DISTINCT ) 小于先GROUP BY 再 COUNT

产生上述结果的原因，还是因为数据集大小的问题。即一个 Reducer 全局排序的时间成本，与划分多个作业阶段申请资源的成本的比较！！！

因此，我们因根据实际的数据量做合理的取舍！！！！

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。