删除重复数据-PG

原创

食品安全辛吉飞 2023-11-09 13:50:49 博主文章分类：PostgreSQL ©著作权

文章标签 重复数据 postgresql 文章分类 开源 指尖人生

©著作权归作者所有：来自51CTO博客作者食品安全辛吉飞的原创作品，请联系作者获取转载授权，否则将追究法律责任

POSTGRESQL的CTID和ORACLE的ROWID 类似.

CTID字段存在于POSTGRESSQL表中，它对于表中的每条记录始终是唯一的。它表示数据的物理位置。

VACUUM FULL操作使得数据被重新分布，CTID的值被RESET

结论

delete from t_test_del a where a.ctid <> (select min(t.ctid) from t_test_del t where a.object_id=t.object_id);

delete from t_test_del2 a where a.ctid not in (select min(ctid) from t_test_del2 group by object_id);

两种去重方式，MOGDB 5.0.1 版本实验和 POSTGRESQL 12.2 效率正好相反

MOGDB 5.0.1 版本实验

drop table t_test_del;
drop table t_test_del2;

create table t_test_del(id serial, col1 text, ctime timestamp);
insert into t_test_del select generate_series(1,5000),random(),now();

insert into t_test_del select * from t_test_del;
insert into t_test_del select * from t_test_del;
insert into t_test_del select * from t_test_del;
create table t_test_del2 as  select * from t_test_del;
select count(*) from t_test_del;
select count(*) from t_test_del2;
select count(distinct id) from t_test_del;

explain (analyse, buffers, costs,timing )
delete from t_test_del a where a.ctid <> (select min(t.ctid) from t_test_del t where a.id=t.id);

SQL1 耗时 164.582 ms

                                                           QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Delete on t_test_del a  (cost=2094.00..2775.25 rows=39999 width=16) (actual time=35.214..163.844 rows=35000 loops=1)
   (Buffers: shared hit=35595 dirtied=260)
   ->  Hash Join  (cost=2094.00..2775.25 rows=39999 width=16) (actual time=35.068..53.410 rows=35000 loops=1)
         Hash Cond: (t.id = a.id)
         Join Filter: (a.ctid <> (min(t.ctid)))
         Rows Removed by Join Filter: 5000
         (Buffers: shared hit=594)
         ->  HashAggregate  (cost=897.00..947.00 rows=5000 width=18) (actual time=16.737..20.033 rows=5000 loops=1)
               Group By Key: t.id
               (Buffers: shared hit=297)
               ->  Seq Scan on t_test_del t  (cost=0.00..697.00 rows=40000 width=10) (actual time=0.008..6.390 rows=40000 loops=1)
                     (Buffers: shared hit=297)
         ->  Hash  (cost=697.00..697.00 rows=40000 width=10) (actual time=18.095..18.095 rows=40000 loops=1)
                Buckets: 65536  Batches: 1  Memory Usage: 2231kB
               (Buffers: shared hit=297)
               ->  Seq Scan on t_test_del a  (cost=0.00..697.00 rows=40000 width=10) (actual time=0.014..7.554 rows=40000 loops=1)
                     (Buffers: shared hit=297)
 Total runtime: 164.582 ms
(18 rows)

explain (analyse, buffers, costs,timing )
delete from t_test_del2 a where a.ctid not in (select min(ctid) from t_test_del2 group by id);

SQL2 耗时 34344.257 ms -- group by

                                                                QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Delete on t_test_del2 a  (cost=897.00..2617570.60 rows=34825 width=12) (actual time=2013.050..34343.828 rows=35000 loops=1)
   (Buffers: shared hit=35595 dirtied=383)
   ->  Nested Loop Anti Join  (cost=897.00..2617570.60 rows=34825 width=12) (actual time=2012.963..32828.016 rows=35000 loops=1)
         Join Filter: ((a.ctid = "ANY_subquery".min) OR ("ANY_subquery".min IS NULL))
         Rows Removed by Join Filter: 187497500
         ->  Seq Scan on t_test_del2 a  (cost=0.00..697.00 rows=40000 width=6) (actual time=0.010..68.794 rows=40000 loops=1)
               (Buffers: shared hit=297)
         ->  Materialize  (cost=897.00..1022.00 rows=5000 width=6) (actual time=26.435..9741.051 rows=187502500 loops=40000)
               (Buffers: shared hit=297)
               ->  Subquery Scan on "ANY_subquery"  (cost=897.00..997.00 rows=5000 width=6) (actual time=16.987..19.366 rows=5000 loops=1)
                     ->  HashAggregate  (cost=897.00..947.00 rows=5000 width=18) (actual time=16.985..18.702 rows=5000 loops=1)
                           Group By Key: t_test_del2.id
                           (Buffers: shared hit=297)
                           ->  Seq Scan on t_test_del2  (cost=0.00..697.00 rows=40000 width=10) (actual time=0.007..6.288 rows=40000 loops=1)
                                 (Buffers: shared hit=297)
 Total runtime: 34344.257 ms
(16 rows)

PG 12.2 版本实验

SQL1 耗时 103650.191ms

explain (analyse, buffers, costs,timing )
delete from t_test_del a where a.ctid <> (select min(t.ctid) from t_test_del t where a.id=t.id);

业务表中有大量重复数据，需要对重复数据删除。重复的数据保留一条

                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Delete on t_test_del a  (cost=0.00..31881997.00 rows=39800 width=6) (actual time=103649.312..103649.312 rows=0 loops=1)
   Buffers: shared hit=11915297 dirtied=64
   ->  Seq Scan on t_test_del a  (cost=0.00..31881997.00 rows=39800 width=6) (actual time=12281.459..103377.365 rows=35000 loops=1)
         Filter: (ctid <> (SubPlan 1))
         Rows Removed by Filter: 5000
         Buffers: shared hit=11880297
         SubPlan 1
           ->  Aggregate  (cost=797.02..797.03 rows=1 width=6) (actual time=2.581..2.581 rows=1 loops=40000)
                 Buffers: shared hit=11880000
                 ->  Seq Scan on t_test_del t  (cost=0.00..797.00 rows=8 width=6) (actual time=0.157..2.572 rows=8 loops=40000)
                       Filter: (a.id = id)
                       Rows Removed by Filter: 39992
                       Buffers: shared hit=11880000
 Planning Time: 0.416 ms
 Execution Time: 103650.191 ms
(15 rows)

SQL2 耗时 84.926 ms

explain (analyse, buffers, costs,timing )
delete from t_test_del2 a where a.ctid not in (select min(ctid) from t_test_del2 group by id);

                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Delete on t_test_del2 a  (cost=959.50..1756.50 rows=20000 width=6) (actual time=84.670..84.670 rows=0 loops=1)
   Buffers: shared hit=35594 dirtied=260
   ->  Seq Scan on t_test_del2 a  (cost=959.50..1756.50 rows=20000 width=6) (actual time=17.085..25.682 rows=35000 loops=1)
         Filter: (NOT (hashed SubPlan 1))
         Rows Removed by Filter: 5000
         Buffers: shared hit=594
         SubPlan 1
           ->  HashAggregate  (cost=897.00..947.00 rows=5000 width=10) (actual time=13.358..14.542 rows=5000 loops=1)
                 Group Key: t_test_del2.id
                 Buffers: shared hit=297
                 ->  Seq Scan on t_test_del2  (cost=0.00..697.00 rows=40000 width=10) (actual time=0.008..5.064 rows=40000 loops=1)
                       Buffers: shared hit=297
 Planning Time: 0.140 ms
 Execution Time: 84.926 ms
(14 rows)