presto count distinct

转载

feiry 2024-09-14 14:41:17

文章标签 presto/trino except rollup with子查询 over开窗函数 文章分类 云原生云计算

1. select查询

语法如下：

[ WITH [ RECURSIVE ] with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expression [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition]
[ WINDOW window_definition_list]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ OFFSET count [ ROW | ROWS ] ]
[ LIMIT { count | ALL } ]
[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES } ]

having子句在group by之后计算，作用于分组聚合后的行
join默认是inner join，还是left outer join、right outer join、full outer join、cross join(对于from a, b where a.key = b.key，前半部分是cronss join)

基础的select例子如下：

trino>
trino> select upper(b.name) customer_name, round(sum(totalprice), 1) total_price 
    -> from tpch.sf1.orders a join tpch.sf1.customer b on a.custkey = b.custkey 
    -> where extract(year from a.orderdate) = 1997 or b.nationkey = 1 
    -> group by b.name 
    -> having round(sum(totalprice), 1) > 10000 
    -> order by total_price desc limit 3;
   customer_name    | total_price 
--------------------+-------------
 CUSTOMER#000111280 |   5776853.7 
 CUSTOMER#000126199 |   5682036.3 
 CUSTOMER#000015751 |   5594552.2 
(3 rows)

Query 20220109_222708_00100_jiymz, FINISHED, 2 nodes
Splits: 14 total, 14 done (100.00%)
4.40 [1.95M rows, 13.2MB] [443K rows/s, 2.99MB/s]

trino>

1.1 union、intersect、except

union all：合并所有的行，不去除重复行
union：合并所有的行，去除重复行
intersect：返回两个查询中都有的行，但去除重复行
except：返回只有第一个查询中有的行，但去除重复行

trino>
trino> select * from (values (1, 2), (1, 2), (2, 3)) except select * from (values (2, 3), (3, 4));
 _col0 | _col1 
-------+-------
     1 |     2 
(1 row)

Query 20220109_223715_00107_jiymz, FINISHED, 2 nodes
Splits: 9 total, 9 done (100.00%)
0.98 [0 rows, 0B] [0 rows/s, 0B/s]

trino>

1.2 grouping set、rollup、cube

分别对每个分组组合进行聚合计算，结果是所有计算结果的联合，没有值的列，则为NULL

例如对于聚合列(a, b, c)：

rollup(a, b, c)生成基于层次结构的分组组合(a, b, c), (a, b), (a), ()
cube(a, b, c)生成所有可能的分组组合(a, b, c), (a, b), (a, c), (b, c), (a), (b), ©, ()
grouping sets可以自定义分组组合，如grouping sets((a, b, c), ())

trino>
trino> select orderstatus, orderpriority, shippriority, sum(totalprice) totalprice, grouping(orderstatus, orderpriority, shippriority) as group_id 
    -> from tpch.sf1.orders 
    -> group by grouping sets((orderstatus, orderpriority, shippriority), ()) 
    -> order by group_id;
 orderstatus |  orderpriority  | shippriority |      totalprice       | group_id 
-------------+-----------------+--------------+-----------------------+----------
 O           | 5-LOW           |            0 | 2.2075989903380028E10 |        0 
 O           | 1-URGENT        |            0 |  2.202843787702009E10 |        0 
 F           | 5-LOW           |            0 | 2.1997577898639805E10 |        0 
 F           | 4-NOT SPECIFIED |            0 |  2.194042389442989E10 |        0 
 O           | 2-HIGH          |            0 | 2.2042604367650177E10 |        0 
 F           | 3-MEDIUM        |            0 | 2.1777505918090096E10 |        0 
 P           | 4-NOT SPECIFIED |            0 |  1.4212495106299987E9 |        0 
 P           | 3-MEDIUM        |            0 |  1.4197204557000003E9 |        0 
 P           | 5-LOW           |            0 |   1.427590893769999E9 |        0 
 O           | 3-MEDIUM        |            0 | 2.1956381714669975E10 |        0 
 O           | 4-NOT SPECIFIED |            0 | 2.1914360578039955E10 |        0 
 P           | 1-URGENT        |            0 |   1.402424471659998E9 |        0 
 F           | 2-HIGH          |            0 | 2.1999039814129982E10 |        0 
 F           | 1-URGENT        |            0 | 2.1987867088400032E10 |        0 
 P           | 2-HIGH          |            0 |  1.4381320612500029E9 |        0 
 NULL        | NULL            |         NULL | 2.2682930644745822E11 |        7 
(16 rows)

Query 20220109_230647_00123_jiymz, FINISHED, 2 nodes
Splits: 11 total, 11 done (100.00%)
1:16 [1.5M rows, 0B] [19.9K rows/s, 0B/s]

trino>

grouping(col1, col2, col3)函数用于标识一行属于哪个分组，计算出来的group id可能不是连续的

1.3 with内联视图

用于在一个查询中定义一个内联视图，供多次查询访问

trino>
trino> with 
    -> total as (
    -> select mktsegment,
    -> sum(acctbal) as total_per_mktsegment
    -> from tpch.sf1.customer
    -> group by 1
    -> ),
    -> average as (
    -> select avg(total_per_mktsegment) as mktsegment_average
    -> from total
    -> )
    -> select mktsegment,
    -> total_per_mktsegment,
    -> mktsegment_average
    -> from total,
    -> average
    -> where total_per_mktsegment > mktsegment_average;
 mktsegment | total_per_mktsegment | mktsegment_average 
------------+----------------------+--------------------
 HOUSEHOLD  |  1.358733411699999E8 |    1.34865369948E8 
 BUILDING   | 1.3588862194000012E8 |    1.34865369948E8 
(2 rows)

Query 20220109_231949_00129_jiymz, FINISHED, 2 nodes
Splits: 19 total, 19 done (100.00%)
3.74 [300K rows, 0B] [80.1K rows/s, 0B/s]

trino>

average内联视图引用了total内联视图，所以total内联视图被引用了两次。
Presto/Trino会基于代价的计算，判断是否将total内联视图的结果进行保存缓存

1.4 子查询

1. 标量子查询

子查询返回单个值(一行一列), 子查询只计算一次

trino>
trino> select name from tpch.sf1.nation 
    -> where regionkey = (
    -> select regionkey from tpch.sf1.region where name = 'AMERICA'
    -> );
     name      
---------------
 ARGENTINA     
 BRAZIL        
 CANADA        
 PERU          
 UNITED STATES 
(5 rows)

Query 20220109_233108_00136_jiymz, FINISHED, 2 nodes
Splits: 11 total, 11 done (100.00%)
1.15 [32 rows, 116B] [27 rows/s, 101B/s]

trino>

2. exists子查询

in子查询：如果一个列的值存在exists子查询的结果中，则为true
not in子查询：如果一个列的值不存在exists子查询的结果中，则为true

trino>
trino> select name from tpch.sf1.nation 
    -> where regionkey not in (
    -> select regionkey from tpch.sf1.region
    -> );
 name 
------
(0 rows)

Query 20220110_001101_00000_jiymz, FINISHED, 2 nodes
Splits: 9 total, 9 done (100.00%)
0.31 [40 rows, 244B] [128 rows/s, 785B/s]

trino>

3. 集合比较子查询

较的运算符除了=，还有<、>、<=、>=、=、<>

集合比较子查询的查询方式有两种：

any：表示只需满足子查询中的一个值， = any等价于in子查询

trino> 
trino> select name from tpch.sf1.nation where regionkey = any (select regionkey from tpch.sf1.region);
      name      
----------------
 ALGERIA        
 ARGENTINA      
 ......省略部分......
 UNITED KINGDOM 
 UNITED STATES  
(25 rows)

Query 20220110_001814_00005_jiymz, FINISHED, 2 nodes
Splits: 13 total, 13 done (100.00%)
0.37 [40 rows, 328B] [107 rows/s, 879B/s]

trino> 
trino> select name from tpch.sf1.nation where regionkey in (select regionkey from tpch.sf1.region);
      name      
----------------
 ALGERIA        
 ARGENTINA      
 ......省略部分......     
 UNITED KINGDOM 
 UNITED STATES  
(25 rows)

Query 20220110_001821_00006_jiymz, FINISHED, 2 nodes
Splits: 13 total, 13 done (100.00%)
0.36 [40 rows, 328B] [111 rows/s, 916B/s]

trino>

对于any, 如果子查询中的一个值为NULL，则对应的集合比较子查询的结果是NULL(不是true和false)。合并结果时一个false和NULL进行合并, 则计算结果为NULL

all：表示必须满足子查询的所有值

对于all，如果子查询为空，则集合比较子查询的结果为true。如果子查询中的一个值为NULL，则对应的集合比较子查询的结果是NULL(不是true和false)。合并结果时一个true和NULL进行合并, 则计算结果为NULL

2. SQL函数

2.1 over (partition by col order by col)

生成测试数据

trino> create schema memory.test_db;
CREATE SCHEMA
trino> 
trino> 
trino> create table memory.test_db.test_tb(
    -> user_id bigint,
    -> buy_year int,
    -> buy_amount int
    -> );
CREATE TABLE
trino> 
trino> 
trino> insert into memory.test_db.test_tb(user_id, buy_year, buy_amount) 
    -> values(1, 2012, 30), 
    -> (1, 2010, 20), 
    -> (1, 2009, 10), 
    -> (2, 2022, 300), 
    -> (2, 2020, 200), 
    -> (2, 2019, 100);
INSERT: 6 rows

Query 20220210_001404_00009_jx84g, FINISHED, 2 nodes
Splits: 5 total, 5 done (100.00%)
0.24 [0 rows, 0B] [0 rows/s, 0B/s]

trino>

1. 开窗求迭代累加和

trino> 
trino> select user_id, buy_year, buy_amount, 
    -> sum(buy_amount) over (partition by user_id order by buy_year) as user_iteration_sum_buy_amount 
    -> from memory.test_db.test_tb;
 user_id | buy_year | buy_amount | user_iteration_sum_buy_amount 
---------+----------+------------+-------------------------------
       1 |     2009 |         10 |                            10 
       1 |     2010 |         20 |                            30 
       1 |     2012 |         30 |                            60 
       2 |     2019 |        100 |                           100 
       2 |     2020 |        200 |                           300 
       2 |     2022 |        300 |                           600 
(6 rows)

Query 20220210_001707_00010_jx84g, FINISHED, 1 node
Splits: 4 total, 4 done (100.00%)
0.24 [6 rows, 114B] [24 rows/s, 469B/s]

trino>

2. 开窗获取前N行的值

lag(col, offset)中的offset值为0或正整数，如果offset为0，表示当前行；offset为2，表示前面第二行

trino> 
trino> select user_id, buy_year, buy_amount, 
    -> lag(buy_amount, 1) over (partition by user_id order by buy_year) as previous_buy_amount 
    -> from memory.test_db.test_tb;
 user_id | buy_year | buy_amount | previous_buy_amount 
---------+----------+------------+---------------------
       1 |     2009 |         10 |                NULL 
       1 |     2010 |         20 |                  10 
       1 |     2012 |         30 |                  20 
       2 |     2019 |        100 |                NULL 
       2 |     2020 |        200 |                 100 
       2 |     2022 |        300 |                 200 
(6 rows)

Query 20220210_001915_00011_jx84g, FINISHED, 1 node
Splits: 4 total, 4 done (100.00%)
0.27 [6 rows, 114B] [21 rows/s, 418B/s]

trino>

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：mysql每个部门内的最高工资

下一篇：presto rlike语法

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯