0 ClickHouse 语法优化规则

ClickHouse 的 SQL 优化规则是基于 RBO(Rule Based Optimization),下面是一些优化规则

1 准备测试用表

1)上传官方的数据集
将 visits_v1.tar 和 hits_v1.tar 上传到虚拟机,解压到 clickhouse 数据路径下
// 解压到 clickhouse 数据路径
sudo tar -xvf hits_v1.tar -C /var/lib/clickhouse
sudo tar -xvf visits_v1.tar -C /var/lib/clickhouse
//修改所属用户
sudo chown -R clickhouse:clickhouse /var/lib/clickhouse/data/datasets
sudo chown -R clickhouse:clickhouse /var/lib/clickhouse/metadata/datasets
2)重启 clickhouse-server
sudo clickhouse restart
3)执行查询
clickhouse-client --query "SELECT COUNT(*) FROM datasets.hits_v1"
clickhouse-client --query "SELECT COUNT(*) FROM datasets.visits_v1"
注意:官方的 tar 包,包含了建库、建表语句、数据内容,这种方式不需要手动建库、建表,最方便。
hits_v1 表有 130 多个字段,880 多万条数据
visits_v1 表有 180 多个字段,160 多万条数据
 

2 COUNT 优化

在调用 count 函数时,如果使用的是 count() 或者 count(*),且没有 where 条件,则
会直接使用 system.tables 的 total_rows,例如:
EXPLAIN SELECT count()FROM datasets.hits_v1;
Union
 Expression (Projection)
     Expression (Before ORDER BY and SELECT)
         MergingAggregated
             ReadNothing (Optimized trivial count)    
注意 Optimized trivial count ,这是对 count 的优化。
如果 count 具体的列字段,则不会使用此项优化:
EXPLAIN SELECT count(CounterID) FROM datasets.hits_v1;
Union
 Expression (Projection)
     Expression (Before ORDER BY and SELECT)
         Aggregating
             Expression (Before GROUP BY)
                 ReadFromStorage (Read from MergeTree)

3 消除子查询重复字段

下面语句子查询中有两个重复的 id 字段,会被去重:
EXPLAIN SYNTAX SELECT 
     a.UserID,
     b.VisitID,
     a.URL,
     b.UserID
     FROM
     hits_v1 AS a 
     LEFT JOIN ( 
     SELECT 
         UserID, 
         UserID as HaHa, 
         VisitID 
 FROM visits_v1) AS b 
 USING (UserID)
 limit 3;
//返回优化语句:
SELECT 
     UserID,
     VisitID,
     URL,
     b.UserID
FROM hits_v1 AS a
ALL LEFT JOIN 
(
 SELECT 
 UserID,
 VisitID
 FROM visits_v1
) AS b USING (UserID)
LIMIT 3

4 谓词下推

当 group by 有 having 子句,但是没有 with cube、with rollup 或者 with totals 修饰的时候,having 过滤会下推到 where 提前过滤。例如下面的查询,HAVING name 变成了 WHERE name,在 group by 之前过滤:
EXPLAIN SYNTAX SELECT UserID FROM hits_v1 GROUP BY UserID HAVING UserID = 
'8585742290196126178';
//返回优化语句
SELECT UserID
FROM hits_v1
WHERE UserID = \'8585742290196126178\'
GROUP BY UserID
子查询也支持谓词下推:
EXPLAIN SYNTAX
SELECT *
FROM 
(
 SELECT UserID
 FROM visits_v1
)
WHERE UserID = '8585742290196126178'
//返回优化后的语句
SELECT UserID
FROM 
(
 SELECT UserID
 FROM visits_v1
 WHERE UserID = \'8585742290196126178\' 
)
WHERE UserID = \'8585742290196126178\'
再来一个复杂例子:
EXPLAIN SYNTAX 
SELECT * FROM (
 SELECT 
 * 
 FROM 
 (
 SELECT 
 UserID 
 FROM visits_v1) 
 UNION ALL 
 SELECT 
 *
 FROM 
 (
 SELECT 
 UserID 
 FROM visits_v1)
)
WHERE UserID = '8585742290196126178'
//返回优化后的语句
SELECT UserID
FROM 
(
 SELECT UserID
 FROM 
 (
 SELECT UserID
 FROM visits_v1
 WHERE UserID = \'8585742290196126178\'
 )
 WHERE UserID = \'8585742290196126178\'
 UNION ALL
 SELECT UserID
 FROM 
 (
 SELECT UserID
 FROM visits_v1
 WHERE UserID = \'8585742290196126178\'
 )
 WHERE UserID = \'8585742290196126178\' )
WHERE UserID = \'8585742290196126178\'

5 聚合计算外推

聚合函数内的计算,会外推,例如:
EXPLAIN SYNTAX
SELECT sum(UserID * 2)
FROM visits_v1
//返回优化后的语句
SELECT sum(UserID) * 2
FROM visits_v1

6 聚合函数消除

如果对聚合键,也就是 group by key 使用 min、max、any 聚合函数,则将函数消除,
例如:
EXPLAIN SYNTAX
SELECT
     sum(UserID * 2),
     max(VisitID),
     max(UserID)
FROM visits_v1
GROUP BY UserID
//返回优化后的语句
SELECT 
     sum(UserID) * 2,
     max(VisitID),
     UserID
FROM visits_v1
GROUP BY UserID

7 删除重复的 order by key

例如下面的语句,重复的聚合键 id 字段会被去重:
EXPLAIN SYNTAX
SELECT *
FROM visits_v1
ORDER BY
     UserID ASC,
     UserID ASC,
     VisitID ASC,
     VisitID ASC
//返回优化后的语句:
select
    ……
FROM visits_v1
ORDER BY 
    UserID ASC,
    VisitID ASC

8 删除重复的 limit by key

例如下面的语句,重复声明的 name 字段会被去重:
EXPLAIN SYNTAX
SELECT *
FROM visits_v1
LIMIT 3 BY
     VisitID,
     VisitID
LIMIT 10
//返回优化后的语句:
select
    ……
FROM visits_v1
LIMIT 3 BY VisitID
LIMIT 10

9 删除重复的 USING Key

例如下面的语句,重复的关联键 id 字段会被去重:
EXPLAIN SYNTAX
SELECT
     a.UserID,
     a.UserID,
     b.VisitID,
     a.URL,
     b.UserID
FROM hits_v1 AS a
LEFT JOIN visits_v1 AS b USING (UserID, UserID)


//返回优化后的语句:
SELECT 
     UserID,
     UserID,
     VisitID,
     URL,
     b.UserID
FROM hits_v1 AS a
ALL LEFT JOIN visits_v1 AS b USING (UserID)

10 标量替换

如果子查询只返回一行数据,在被引用的时候用标量替换,例如下面语句中的total_disk_usage 字段:
EXPLAIN SYNTAX
WITH 
 (
 SELECT sum(bytes)
 FROM system.parts
 WHERE active
 ) AS total_disk_usage
SELECT
 (sum(bytes) / total_disk_usage) * 100 AS table_disk_usage,
 table
FROM system.parts
GROUP BY table
ORDER BY table_disk_usage DESC
LIMIT 10;

//返回优化后的语句: WITH CAST(0, \'UInt64\') AS total_disk_usage SELECT (sum(bytes) / total_disk_usage) * 100 AS table_disk_usage, table FROM system.parts GROUP BY table ORDER BY table_disk_usage DESC LIMIT 10

11 三元运算优化

如果开启了 optimize_if_chain_to_multiif 参数,三元运算符会被替换成 multiIf 函数,
例如:
EXPLAIN SYNTAX 
SELECT number = 1 ? 'hello' : (number = 2 ? 'world' : 'atguigu') 
FROM numbers(10) 
settings optimize_if_chain_to_multiif = 1;

//返回优化后的语句:
SELECT multiIf(number = 1, \'hello\', number = 2, \'world\', \'atguigu\')
FROM numbers(10)
SETTINGS optimize_if_chain_to_multiif = 1