一直以来DISTINCT的用法都为DBA所知,就是用来滤重。我们也没有必要质疑DISTINCT的滤重准确性。
但今天突然不知道怎么想的,就想质疑一下。DISTINCT真的能保证过滤的对象没有重复记录吗?
SQL> create table test_distinct as select rownum id from all_objects where rownum < 50000; insert into test_distinct select rownum id from all_objects where rownum < 50000;
这表中有重复的数据,数据插入顺序1~4999,然后再重复一次
SQL> select count(id) from (select distinct id from test_distinct) group by id having count(id) > 1; no rows selected
可以看到,通过这条语句发现确实没有重复的行。那么进一步思考如果不用distinct如何实现这个效果呢。我想起了ROWID这个东西,以下两句的效果相同。
select count(distinct phoneno) from CUSTPHONE select count(phoneno) from CUSTPHONE where rowid in( select min(rowid) from CUSTPHONE group by phoneno)
我想了解一下DISTINCT语句实际在Oracle中是如何操作的,通过10046事件和tkprof工具获取跟踪的信息。
SQL ID: 8vtyapcbqkbwf Plan Hash: 2372476266 select distinct id from test_distinct where rownum < 100 call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.02 0.02 0 138 0 0 Execute 1 0.00 0.00 0 0 0 0 Fetch 8 0.00 0.00 0 4 0 99 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 10 0.02 0.02 0 142 0 99 Misses in library cache during parse: 1 Optimizer mode: ALL_ROWS Parsing user id: SYS Rows Row Source Operation ------- --------------------------------------------------- 99 HASH UNIQUE (cr=4 pr=0 pw=0 time=0 us cost=528 size=1287 card=99) 99 COUNT STOPKEY (cr=4 pr=0 pw=0 time=196 us) 99 TABLE ACCESS FULL TEST_DISTINCT (cr=4 pr=0 pw=0 time=98 us cost=44 size=1318174 card=101398) Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ SQL*Net message to client 8 0.00 0.00 asynch descriptor resize 2 0.00 0.00 SQL*Net message from client 8 59.43 59.43
可以看到执行计划中DISINCT是通过HASH UNIQUE算法来实现的。同时ROWNUM虚列使用的是COUNT算法,STOPKEY说明我给ROWNUM虚列加了限定条件100,当到达这个限定条件时,该语句查询结束。
那么到这我该怎么理解HASH UNIQUE算法的目的呢?我在网上查看了相关信息,发现真有人做了实验实验帮助我们加上对该算法的印象。在10G2R以前,Oracle对DISTINCT使用的是sort unique这种操作方式因为涉及到排序,是非常影响语句的执行效率的。因此10G2R之后的版本,Oracle改进了算法。
SQL> select distinct id from test_distinct where rownum < 100; 99 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 2372476266 --------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | --------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 99 | 1287 | | 528 (1)| 00:00:07 | | 1 | HASH UNIQUE | | 99 | 1287 | 2000K| 528 (1)| 00:00:07 | |* 2 | COUNT STOPKEY | | | | | | | | 3 | TABLE ACCESS FULL| TEST_DISTINCT | 101K| 1287K| | 44 (3)| 00:00:01 | --------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter(ROWNUM<100) Note ----- - dynamic sampling used for this statement (level=2) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 4 consistent gets 8 physical reads 0 redo size 2134 bytes sent via SQL*Net to client 589 bytes received via SQL*Net from client 8 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 99 rows processed SQL> alter system flush buffer_cache SQL> select id from test_distinct where rownum < 100 group by id; 99 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 521476922 ------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 99 | 1287 | 47 (9)| 00:00:01 | | 1 | HASH GROUP BY | | 99 | 1287 | 47 (9)| 00:00:01 | |* 2 | COUNT STOPKEY | | | | | | | 3 | TABLE ACCESS FULL| TEST_DISTINCT | 101K| 1287K| 44 (3)| 00:00:01 | ------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter(ROWNUM<100) Note ----- - dynamic sampling used for this statement (level=2) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 4 consistent gets 8 physical reads 0 redo size 2134 bytes sent via SQL*Net to client 589 bytes received via SQL*Net from client 8 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 99 rows processed
通过网上这组实验可以看到DISTINCT和GROUP BY分别使用了HASH UNIQUE和HASH GROUP BY算法。而两者执行效果和结果都相同。基于规则的DISTINCT和GROUP BY的查询的执行计划如下
SQL> select /*+ rule*/ distinct id from test_distinct where rownum < 100; 99 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 3449293992 --------------------------------------------- | Id | Operation | Name | --------------------------------------------- | 0 | SELECT STATEMENT | | | 1 | SORT UNIQUE | | |* 2 | COUNT STOPKEY | | | 3 | TABLE ACCESS FULL| TEST_DISTINCT | --------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter(ROWNUM<100) Note ----- - rule based optimizer used (consider using cbo) Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 4 consistent gets 8 physical reads 0 redo size 2134 bytes sent via SQL*Net to client 589 bytes received via SQL*Net from client 8 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 99 rows processed SQL> select /*+ rule*/ id from test_distinct where rownum < 100 group by id; 99 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 351786816 --------------------------------------------- | Id | Operation | Name | --------------------------------------------- | 0 | SELECT STATEMENT | | | 1 | SORT GROUP BY | | |* 2 | COUNT STOPKEY | | | 3 | TABLE ACCESS FULL| TEST_DISTINCT | --------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter(ROWNUM<100) Note ----- - rule based optimizer used (consider using cbo) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 4 consistent gets 8 physical reads 0 redo size 2134 bytes sent via SQL*Net to client 589 bytes received via SQL*Net from client 8 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 99 rows processed
因为表的数据量教啥,基于rule的算法和通过hash的算法区别并不明显。但重点在于了解Oracle的不同算法的可能。
另外,ORACLE除了提供DISTINCT以外,还提供了UNIQUE来过滤重复的数据。官方文档给出的解释。
http://docs.oracle.com/cd/E11882_01/server.112/e41084/statements_10002.htm#SQLRF55272
Specify DISTINCT
or UNIQUE
if you want the database to return only one copy of each set of duplicate rows selected. These two keywords are synonymous. Duplicate rows are those with matching values for each expression in the select list.
到这里我已经理解了DISTINCT的作用,同时还学习了其它新的知识。非常棒!