POSTGRESQL的CTID和ORACLE的ROWID 类似.
CTID字段存在于POSTGRESSQL表中,它对于表中的每条记录始终是唯一的。 它表示数据的物理位置。
VACUUM FULL操作使得数据被重新分布,CTID的值被RESET
结论
delete from t_test_del a where a.ctid <> (select min(t.ctid) from t_test_del t where a.object_id=t.object_id);
delete from t_test_del2 a where a.ctid not in (select min(ctid) from t_test_del2 group by object_id);
两种去重方式,MOGDB 5.0.1 版本实验 和 POSTGRESQL 12.2 效率正好相反
MOGDB 5.0.1 版本实验
drop table t_test_del;
drop table t_test_del2;
create table t_test_del(id serial, col1 text, ctime timestamp);
insert into t_test_del select generate_series(1,5000),random(),now();
insert into t_test_del select * from t_test_del;
insert into t_test_del select * from t_test_del;
insert into t_test_del select * from t_test_del;
create table t_test_del2 as select * from t_test_del;
select count(*) from t_test_del;
select count(*) from t_test_del2;
select count(distinct id) from t_test_del;
explain (analyse, buffers, costs,timing )
delete from t_test_del a where a.ctid <> (select min(t.ctid) from t_test_del t where a.id=t.id);
SQL1 耗时 164.582 ms
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Delete on t_test_del a (cost=2094.00..2775.25 rows=39999 width=16) (actual time=35.214..163.844 rows=35000 loops=1)
(Buffers: shared hit=35595 dirtied=260)
-> Hash Join (cost=2094.00..2775.25 rows=39999 width=16) (actual time=35.068..53.410 rows=35000 loops=1)
Hash Cond: (t.id = a.id)
Join Filter: (a.ctid <> (min(t.ctid)))
Rows Removed by Join Filter: 5000
(Buffers: shared hit=594)
-> HashAggregate (cost=897.00..947.00 rows=5000 width=18) (actual time=16.737..20.033 rows=5000 loops=1)
Group By Key: t.id
(Buffers: shared hit=297)
-> Seq Scan on t_test_del t (cost=0.00..697.00 rows=40000 width=10) (actual time=0.008..6.390 rows=40000 loops=1)
(Buffers: shared hit=297)
-> Hash (cost=697.00..697.00 rows=40000 width=10) (actual time=18.095..18.095 rows=40000 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 2231kB
(Buffers: shared hit=297)
-> Seq Scan on t_test_del a (cost=0.00..697.00 rows=40000 width=10) (actual time=0.014..7.554 rows=40000 loops=1)
(Buffers: shared hit=297)
Total runtime: 164.582 ms
(18 rows)
explain (analyse, buffers, costs,timing )
delete from t_test_del2 a where a.ctid not in (select min(ctid) from t_test_del2 group by id);
SQL2 耗时 34344.257 ms -- group by
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Delete on t_test_del2 a (cost=897.00..2617570.60 rows=34825 width=12) (actual time=2013.050..34343.828 rows=35000 loops=1)
(Buffers: shared hit=35595 dirtied=383)
-> Nested Loop Anti Join (cost=897.00..2617570.60 rows=34825 width=12) (actual time=2012.963..32828.016 rows=35000 loops=1)
Join Filter: ((a.ctid = "ANY_subquery".min) OR ("ANY_subquery".min IS NULL))
Rows Removed by Join Filter: 187497500
-> Seq Scan on t_test_del2 a (cost=0.00..697.00 rows=40000 width=6) (actual time=0.010..68.794 rows=40000 loops=1)
(Buffers: shared hit=297)
-> Materialize (cost=897.00..1022.00 rows=5000 width=6) (actual time=26.435..9741.051 rows=187502500 loops=40000)
(Buffers: shared hit=297)
-> Subquery Scan on "ANY_subquery" (cost=897.00..997.00 rows=5000 width=6) (actual time=16.987..19.366 rows=5000 loops=1)
-> HashAggregate (cost=897.00..947.00 rows=5000 width=18) (actual time=16.985..18.702 rows=5000 loops=1)
Group By Key: t_test_del2.id
(Buffers: shared hit=297)
-> Seq Scan on t_test_del2 (cost=0.00..697.00 rows=40000 width=10) (actual time=0.007..6.288 rows=40000 loops=1)
(Buffers: shared hit=297)
Total runtime: 34344.257 ms
(16 rows)
PG 12.2 版本实验
SQL1 耗时 103650.191ms
explain (analyse, buffers, costs,timing )
delete from t_test_del a where a.ctid <> (select min(t.ctid) from t_test_del t where a.id=t.id);
业务表中有大量重复数据, 需要对重复数据删除。重复的数据保留一条
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Delete on t_test_del a (cost=0.00..31881997.00 rows=39800 width=6) (actual time=103649.312..103649.312 rows=0 loops=1)
Buffers: shared hit=11915297 dirtied=64
-> Seq Scan on t_test_del a (cost=0.00..31881997.00 rows=39800 width=6) (actual time=12281.459..103377.365 rows=35000 loops=1)
Filter: (ctid <> (SubPlan 1))
Rows Removed by Filter: 5000
Buffers: shared hit=11880297
SubPlan 1
-> Aggregate (cost=797.02..797.03 rows=1 width=6) (actual time=2.581..2.581 rows=1 loops=40000)
Buffers: shared hit=11880000
-> Seq Scan on t_test_del t (cost=0.00..797.00 rows=8 width=6) (actual time=0.157..2.572 rows=8 loops=40000)
Filter: (a.id = id)
Rows Removed by Filter: 39992
Buffers: shared hit=11880000
Planning Time: 0.416 ms
Execution Time: 103650.191 ms
(15 rows)
SQL2 耗时 84.926 ms
explain (analyse, buffers, costs,timing )
delete from t_test_del2 a where a.ctid not in (select min(ctid) from t_test_del2 group by id);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Delete on t_test_del2 a (cost=959.50..1756.50 rows=20000 width=6) (actual time=84.670..84.670 rows=0 loops=1)
Buffers: shared hit=35594 dirtied=260
-> Seq Scan on t_test_del2 a (cost=959.50..1756.50 rows=20000 width=6) (actual time=17.085..25.682 rows=35000 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 5000
Buffers: shared hit=594
SubPlan 1
-> HashAggregate (cost=897.00..947.00 rows=5000 width=10) (actual time=13.358..14.542 rows=5000 loops=1)
Group Key: t_test_del2.id
Buffers: shared hit=297
-> Seq Scan on t_test_del2 (cost=0.00..697.00 rows=40000 width=10) (actual time=0.008..5.064 rows=40000 loops=1)
Buffers: shared hit=297
Planning Time: 0.140 ms
Execution Time: 84.926 ms
(14 rows)