摘要: 标签 PostgreSQL , 去重 , 单列去重 , 多列去重 , 行去重 , 多列混合去重 , varidict 参数 , 数组排序 , 数组元素重排 背景 去重的需求比较常见,去重也可以衍生出很多变种。

标签

PostgreSQL , 去重 , 单列去重 , 多列去重 , 行去重 , 多列混合去重 , varidict 参数 , 数组排序 , 数组元素重排

背景

去重的需求比较常见,去重也可以衍生出很多变种。例如

1. 单列去重,很好理解,就是按某列去除重复记录。保留规则(例如保留最新的,保留最旧的,或者保留某个其他字段最大的)。

2. 多列去重,按多列,去除重复记录。保留规则(例如保留最新的,保留最旧的,或者保留某个其他字段最大的)。

3. 行去重,按行,去除重复记录。保留规则(例如保留最新的,保留最旧的,或者保留某个其他字段最大的)。

4. 多列混合去重,按多列混合模式(ROW1: col1=1 , col2=2;ROW2: col1=2, col2=1;这种混合重复的去重),去除重复记录。保留规则(例如保留最新的,保留最旧的,或者保留某个其他字段最大的)。

下面依次举例,使用不同的方法去重,用户可以自由选择效率最佳的。

单列去重

测试数据

create table test1(id int primary key, c1 int, c2 timestamp); 

insert into test1 select generate_series(1,1000000), random()*1000, clock_timestamp(); 


create index idx_test1 on test1(c1,id); 

-- 这个索引可以起到加速效果。如果没有这个索引,以下三种方法,第二种效率最高,其次是第三种。 

需求:去除c1重复的行,保留id最大的。 


方法1,使用聚合,not in 


postgres=# explain delete from test1 where id not in (select max(id) from test1 group by c1); 

 QUERY PLAN 

------------------------------------------------------------------------------------------------------------------ 

 Delete on test1 (cost=35115.63..53023.01 rows=500055 width=6) 

 -> Seq Scan on test1 (cost=35115.63..53023.01 rows=500055 width=6) 

 Filter: (NOT (hashed SubPlan 1)) 

 SubPlan 1 

 -> GroupAggregate (cost=0.42..35113.13 rows=1001 width=8) 

 Group Key: test1_1.c1 

 -> Index Only Scan using idx_test1 on test1 test1_1 (cost=0.42..30102.57 rows=1000110 width=8) 

(7 rows) 

Time: 0.564 ms 


postgres=# delete from test1 where id not in (select max(id) from test1 group by c1); 

DELETE 998999 

Time: 1126.504 ms (00:01.127) 

方法2,使用窗口查询,IN 


postgres=# explain select id from (select row_number() over(partition by c1 order by id) as rn, id from test1) t where t.rn<>1; 

 QUERY PLAN 

-------------------------------------------------------------------------------------------------- 

 Subquery Scan on t (cost=0.42..60075.54 rows=995109 width=4) 

 Filter: (t.rn <> 1) 

 -> WindowAgg (cost=0.42..47574.17 rows=1000110 width=16) 

 -> Index Only Scan using idx_test1 on test1 (cost=0.42..30072.24 rows=1000110 width=8) 

(4 rows) 

Time: 0.512 ms 


postgres=# delete from test1 where id in (select id from (select row_number() over(partition by c1 order by id) as rn, id from test1) t where t.rn<>1); 

DELETE 998999 

Time: 2430.276 ms (00:02.430) 

方法3,使用PLPGSQL,内部使用排序+游标。 


每条记录判断一次的方法去重,只有一次排序+每条记录比对的开销。 


do language plpgsql $$ 

declare 

 v_rec record; 

 v_c1 int; 

 cur1 cursor for select c1,id from test1 order by c1,id for update; 

begin 

 for v_rec in cur1 loop 

 if v_rec.c1 = v_c1 then 

 delete from test1 where current of cur1; 

 end if; 

 v_c1 := v_rec.c1; 

 end loop; 

end; 

$$; 


DO 

Time: 7345.773 ms (00:07.346) 


postgres=# select count(*) from test1; 

 count 

------- 

 1001 

(1 row) 


Time: 61.672 ms 

postgres=# select * from test1 limit 10; 

 id | c1 | c2 

----+-----+---------------------------- 

 1 | 582 | 2017-06-02 10:21:10.60918 

 2 | 278 | 2017-06-02 10:21:10.609331 

 3 | 659 | 2017-06-02 10:21:10.609338 

 4 | 372 | 2017-06-02 10:21:10.609341 

 5 | 184 | 2017-06-02 10:21:10.609343 

 6 | 121 | 2017-06-02 10:21:10.609345 

 7 | 132 | 2017-06-02 10:21:10.609347 

 8 | 290 | 2017-06-02 10:21:10.609348 

 9 | 980 | 2017-06-02 10:21:10.60935 

 10 | 305 | 2017-06-02 10:21:10.609352 

(10 rows) 

PostgreSQL 10黑科技 


即使只有部分驱动列,也能使用索引排序。 


例如index(c1),可以用于order by c1,id; 


《PostgreSQL 10.0 preview 优化器改进 - 不完整索引支持复合排序》 


多列去重 


测试数据 


create table test1(id int primary key, c1 int, c2 int, c3 timestamp); 

insert into test1 select generate_series(1,1000000), random()*1000, random()*1000, clock_timestamp(); 


create index idx_test1 on test1(c1,c2,id); 

-- 这个索引可以起到加速效果。 

需求:去除c1,c2重复的行,保留id最大的。 


方法1, 


postgres=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where id not in (select max(id) from test1 group by c1,c2); 

 QUERY PLAN 

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 

 Delete on public.test1 (cost=40820.36..59690.36 rows=500000 width=6) (actual time=1634.960..1634.960 rows=0 loops=1) 

 Buffers: shared hit=1378788 

 -> Seq Scan on public.test1 (cost=40820.36..59690.36 rows=500000 width=6) (actual time=1090.956..1446.374 rows=367618 loops=1) 

 Output: test1.ctid 

 Filter: (NOT (hashed SubPlan 1)) 

 Rows Removed by Filter: 632382 

 Buffers: shared hit=1011170 

 SubPlan 1 

 -> GroupAggregate (cost=0.42..40570.36 rows=100000 width=12) (actual time=0.035..842.497 rows=632382 loops=1) 

 Output: max(test1_1.id), test1_1.c1, test1_1.c2 

 Group Key: test1_1.c1, test1_1.c2 

 Buffers: shared hit=1004800 

 -> Index Only Scan using idx_test1 on public.test1 test1_1 (cost=0.42..32070.36 rows=1000000 width=12) (actual time=0.027..587.506 rows=1000000 loops=1) 

 Output: test1_1.c1, test1_1.c2, test1_1.id 

 Heap Fetches: 1000000 

 Buffers: shared hit=1004800 

 Planning time: 0.211 ms 

 Execution time: 1641.679 ms 

(18 rows) 

方法2,0