题目 部分

在Oracle中,什么是基数(Cardinality)和可选择率(Selectivity)?

答案部分

 

基数(Cardinality)是Oracle预估的返回行数,即对目标SQL的某个具体执行步骤的执行结果所包含记录数的估算值。如果是针对整个目标SQL,那么此时的Cardinality就表示该SQL最终执行结果所包含记录数的估算值。例如,一张表T有1000行数据,列COL1上没有直方图,没有空值,并且不重复的值(Distinct Value)有500个。那么,在使用条件“WHERE COL1=<VALUE>”去访问表的时候,优化器会假设数据均匀分布,它估计出会有1000/500=2行被选出来,2就是这步操作的Cardinality。通常情况下,Cardinality越准确,生成的执行计划就会越高效。

可选择率(Selectivity)是指施加指定谓词条件后返回结果集的记录数占未施加任何谓词条件的原始结果集的记录数的比率。可选择率的取值范围显然是0~1,它的值越小,就表明可选择性越好。当可选择率为1时的可选择性是最差的。CBO就是用可选择率来估算对应结果集的Cardinality的,可选择率和Cardinality之间的关系如下所示:

1cardinality=NUM_ROWS*selectivity

其中,NUM_ROWS表示表的总行数。

在Oracle数据库中,Oracle会默认认为SQL语句的WHERE条件中出现的各列彼此之间是独立的,是没有关联关系的。所以,如果目标SQL语句各列之间是以AND来组合的话,那么该SQL语句整个WHERE条件的组合可选择率就等于各个列各自施加查询条件后可选择率的乘积。在得到了SQL语句整个WHERE条件的组合可选择率后,Oracle会用它来估算整个SQL语句返回结果集的Cardinality,估算的方法就是用目标表的总记录数(NUM_ROWS)乘组合可选择率。但Oracle默认认为的各列之间是独立的、没有关联关系的前提条件并不总是正确的,在实际的应用中各列之间有关联关系的情况实际上并不罕见。在这种情况下如果还用上述计算方法来计算目标SQL语句整个WHERE条件的组合可选择率并用它来估算返回结果集的Cardinality的话,那么估算结果可能就会与实际结果有较大的偏差,进而可能导致CBO选错执行计划,所以Oracle又引入了动态采样和多列统计信息。

下表给出了一些常见的可选择率计算公式:

【DB笔试面试646】在Oracle中,什么是基数(Cardinality)和可选择率(Selectivity)?_sql

【DB笔试面试646】在Oracle中,什么是基数(Cardinality)和可选择率(Selectivity)?_sql语句_02

下面给出一个示例:

 1DROP TABLE T_ROWS_20170605_LHR;
 2CREATE TABLE T_ROWS_20170605_LHR AS SELECT ROWNUM ID,'NAME1' SAL FROM DUAL CONNECT BY LEVEL<=10000;
 3UPDATE T_ROWS_20170605_LHR T SET T.ID='' WHERE T.ID<=100;
 4EXEC DBMS_STATS.GATHER_TABLE_STATS(USER,'T_ROWS_20170605_LHR',CASCADE=>TRUE,METHOD_OPT=>'FOR ALL COLUMNS SIZE 1',estimate_percent => 100);
 5LHR@orclasm > COL LOW_VALUE FORMAT A20
 6LHR@orclasm > COL HIGH_VALUE FORMAT A20
 7LHR@orclasm > SELECT D.LOW_VALUE,D.HIGH_VALUE,UTL_RAW.CAST_TO_NUMBER(D.LOW_VALUE) LOW_VALUE2,UTL_RAW.CAST_TO_NUMBER(D.HIGH_VALUE) HIGH_VALUE2, D.NUM_DISTINCT,D.NUM_NULLS FROM DBA_TAB_COL_STATISTICS D WHERE D.TABLE_NAME = 'T_ROWS_20170605_LHR' AND D.OWNER = 'LHR' AND D.COLUMN_NAME='ID';
 8
 9LOW_VALUE            HIGH_VALUE           LOW_VALUE2 HIGH_VALUE2 NUM_DISTINCT  NUM_NULLS
10-------------------- -------------------- ---------- ----------- ------------ ----------
11C20202               C302                        101       10000         9900        100
12
13LHR@orclasm > SELECT MIN(T.ID),DUMP(MIN(T.ID),16) LOW_VALUE,MAX(T.ID),DUMP(MAX(T.ID),16) HIGH_VALUE FROM  T_ROWS_20170605_LHR T;
14 MIN(T.ID) LOW_VALUE             MAX(T.ID) HIGH_VALUE
15---------- -------------------- ---------- --------------------
16       101 Typ=2 Len=3: c2,2,2       10000 Typ=2 Len=2: c3,2

下面分别执行如下4条SQL语句并获取执行计划:

  1SELECT COUNT(1) FROM T_ROWS_20170605_LHR T WHERE T.ID=1000;--1
  2SELECT COUNT(1) FROM T_ROWS_20170605_LHR T WHERE T.ID>1000; --9000
  3SELECT COUNT(1) FROM T_ROWS_20170605_LHR T WHERE T.ID>=1000; --9001
  4SELECT COUNT(1) FROM T_ROWS_20170605_LHR T WHERE T.ID BETWEEN 1000 AND 1100; --101
  5
  6
  7LHR@orclasm > set autot on exp
  8LHR@orclasm > SELECT COUNT(1) FROM T_ROWS_20170605_LHR T WHERE T.ID=1000;
  9  COUNT(1)
 10----------
 11         1
 12Execution Plan
 13----------------------------------------------------------
 14Plan hash value: 612708570
 15------------------------------------------------------------------------------------------
 16| Id  | Operation          | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
 17------------------------------------------------------------------------------------------
 18|   0 | SELECT STATEMENT   |                     |     1 |     4 |     9   (0)| 00:00:01 |
 19|   1 |  SORT AGGREGATE    |                     |     1 |     4 |            |          |
 20|*  2 |   TABLE ACCESS FULL| T_ROWS_20170605_LHR |     1 |     4 |     9   (0)| 00:00:01 |
 21------------------------------------------------------------------------------------------
 22Predicate Information (identified by operation id):
 23---------------------------------------------------
 24   2 - filter("T"."ID"=1000)
 25
 26-- ROUND(NUM_ROWS*(1/NUM_DISTINCT)*((NUM_ROWS-NUM_NULLS)/NUM_ROWS))
 27LHR@orclasm > SELECT ROUND(10000*1/9900*((10000-100)/10000)) VALUE FROM DUAL;
 28
 29     VALUE
 30----------
 31         1
 32
 33LHR@orclasm > SELECT COUNT(1) FROM T_ROWS_20170605_LHR T WHERE T.ID>1000;
 34  COUNT(1)
 35----------
 36      9000
 37Execution Plan
 38----------------------------------------------------------
 39Plan hash value: 612708570
 40------------------------------------------------------------------------------------------
 41| Id  | Operation          | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
 42------------------------------------------------------------------------------------------
 43|   0 | SELECT STATEMENT   |                     |     1 |     4 |     9   (0)| 00:00:01 |
 44|   1 |  SORT AGGREGATE    |                     |     1 |     4 |            |          |
 45|*  2 |   TABLE ACCESS FULL| T_ROWS_20170605_LHR |  9001 | 36004 |     9   (0)| 00:00:01 |
 46------------------------------------------------------------------------------------------
 47Predicate Information (identified by operation id):
 48---------------------------------------------------
 49   2 - filter("T"."ID">1000)
 50
 51--ROUND(NUM_ROWS*((HIGH_VALUE-VAL)/(HIGH_VALUE-LOW_VALUE))*((NUM_ROWS-NUM_NULLS)/NUM_ROWS))
 52LHR@orclasm > SELECT ROUND(10000*((10000-1000)/(10000-101))*((10000-100)/10000)) VALUE FROM DUAL;
 53
 54     VALUE
 55----------
 56      9001
 57
 58
 59LHR@orclasm > SELECT COUNT(1) FROM T_ROWS_20170605_LHR T WHERE T.ID>=1000;
 60  COUNT(1)
 61----------
 62      9001
 63Execution Plan
 64----------------------------------------------------------
 65Plan hash value: 612708570
 66------------------------------------------------------------------------------------------
 67| Id  | Operation          | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
 68------------------------------------------------------------------------------------------
 69|   0 | SELECT STATEMENT   |                     |     1 |     4 |     9   (0)| 00:00:01 |
 70|   1 |  SORT AGGREGATE    |                     |     1 |     4 |            |          |
 71|*  2 |   TABLE ACCESS FULL| T_ROWS_20170605_LHR |  9002 | 36008 |     9   (0)| 00:00:01 |
 72------------------------------------------------------------------------------------------
 73Predicate Information (identified by operation id):
 74---------------------------------------------------
 75   2 - filter("T"."ID">=1000)
 76
 77--ROUND(NUM_ROWS*((HIGH_VALUE-VAL)/(HIGH_VALUE-LOW_VALUE)+1/NUM_DISTINCT)*((NUM_ROWS-NUM_NULLS)/NUM_ROWS))
 78LHR@orclasm > SELECT ROUND(10000*((10000-1000)/(10000-101)+1/9900)*((10000-100)/10000)) VALUE FROM DUAL;
 79
 80     VALUE
 81----------
 82      9002
 83
 84
 85LHR@orclasm > SELECT COUNT(1) FROM T_ROWS_20170605_LHR T WHERE T.ID BETWEEN 1000 AND 1100;
 86  COUNT(1)
 87----------
 88       101
 89Execution Plan
 90----------------------------------------------------------
 91Plan hash value: 612708570
 92------------------------------------------------------------------------------------------
 93| Id  | Operation          | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
 94------------------------------------------------------------------------------------------
 95|   0 | SELECT STATEMENT   |                     |     1 |     4 |     9   (0)| 00:00:01 |
 96|   1 |  SORT AGGREGATE    |                     |     1 |     4 |            |          |
 97|*  2 |   TABLE ACCESS FULL| T_ROWS_20170605_LHR |   102 |   408 |     9   (0)| 00:00:01 |
 98------------------------------------------------------------------------------------------
 99Predicate Information (identified by operation id):
100---------------------------------------------------
101   2 - filter("T"."ID"<=1100 AND "T"."ID">=1000)
102LHR@orclasm > 
103
104--ROUND(NUM_ROWS*(((VAL2-VAL1)/(HIGH_VALUE-LOW_VALUE)+2/NUM_DISTINCT)*((NUM_ROWS-NUM_NULLS)/NUM_ROWS)))
105LHR@orclasm > SELECT ROUND(10000*(((1100-1000)/(10000-101)+2/9900)*((10000-100)/10000))) VALUE FROM DUAL;
106
107     VALUE
108----------
109       102

可见预估行数和用公式计算出来的结果相吻合。

下面再查看有频率直方图的时候基数的计算。

1DROP TABLE T_ROWS_20170605_LHR;
2CREATE TABLE T_ROWS_20170605_LHR AS SELECT ROWNUM ID,'NAME1' SAL FROM DUAL CONNECT BY LEVEL<=10000;
3UPDATE T_ROWS_20170605_LHR T SET T.ID='' WHERE T.ID<=100;
4UPDATE T_ROWS_20170605_LHR SET ID=2 WHERE ID BETWEEN 101 AND 200;
5UPDATE T_ROWS_20170605_LHR SET ID=3 WHERE ID BETWEEN 200 AND 3000; 
6UPDATE T_ROWS_20170605_LHR SET ID=9 WHERE ID BETWEEN 3000 AND 9999;
7SELECT T.ID,COUNT(*) FROM T_ROWS_20170605_LHR T GROUP BY T.ID;

查看数据分布:

1LHR@orclasm > SELECT T.ID,COUNT(*) FROM T_ROWS_20170605_LHR T GROUP BY T.ID;
2
3        ID   COUNT(*)
4---------- ----------
5                  100
6     10000          1
7         2        100
8         3       2800
9         9       6999

收集频率直方图:

 1LHR@orclasm > EXEC DBMS_STATS.GATHER_TABLE_STATS(USER,'T_ROWS_20170605_LHR',CASCADE=>TRUE,METHOD_OPT=>'FOR COLUMNS ID SIZE 6',estimate_percent => 100);
 2
 3PL/SQL procedure successfully completed.
 4
 5LHR@orclasm > SELECT D.COLUMN_NAME,D.NUM_DISTINCT,D.NUM_NULLS,D.NUM_BUCKETS,D.HISTOGRAM,D.DENSITY FROM DBA_TAB_COLUMNS D WHERE D.TABLE_NAME = 'T_ROWS_20170605_LHR' AND D.COLUMN_NAME='ID';
 6
 7COLUMN_NAME                    NUM_DISTINCT  NUM_NULLS NUM_BUCKETS HISTOGRAM          DENSITY
 8------------------------------ ------------ ---------- ----------- --------------- ----------
 9ID                                        4        100           4 FREQUENCY       .000050505
10
11LHR@orclasm > COL COLUMN_NAME FORMAT A6
12LHR@orclasm > SELECT TABLE_NAME,COLUMN_NAME,ENDPOINT_NUMBER,ENDPOINT_VALUE,NVL((ENDPOINT_NUMBER-(LAG(ENDPOINT_NUMBER) OVER (ORDER BY ENDPOINT_VALUE))),ENDPOINT_NUMBER) COUNTS FROM DBA_TAB_HISTOGRAMS WHERE TABLE_NAME='T_ROWS_20170605_LHR' AND COLUMN_NAME='ID';
13
14TABLE_NAME                     COLUMN ENDPOINT_NUMBER ENDPOINT_VALUE     COUNTS
15------------------------------ ------ --------------- -------------- ----------
16T_ROWS_20170605_LHR            ID                 100              2        100
17T_ROWS_20170605_LHR            ID                2900              3       2800
18T_ROWS_20170605_LHR            ID                9899              9       6999
19T_ROWS_20170605_LHR            ID                9900          10000          1

当目标列有频率直方图并且对目标列施加等值查询条件时,如果查询条件的输入值等于目标列的某个Bucket的ENDPOINT_VALUE,那么cardinality=Current_ENDPOINT_NUMBER-Previous_ENDPOINT_NUMBER:

 1LHR@orclasm > SET AUTOT ON
 2LHR@orclasm > SELECT COUNT(1) FROM T_ROWS_20170605_LHR T WHERE T.ID=3;
 3
 4  COUNT(1)
 5----------
 6      2800
 7
 8Execution Plan
 9----------------------------------------------------------
10Plan hash value: 612708570
11
12------------------------------------------------------------------------------------------
13| Id  | Operation          | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
14------------------------------------------------------------------------------------------
15|   0 | SELECT STATEMENT   |                     |     1 |     3 |     9   (0)| 00:00:01 |
16|   1 |  SORT AGGREGATE    |                     |     1 |     3 |            |          |
17|*  2 |   TABLE ACCESS FULL| T_ROWS_20170605_LHR |  2800 |  8400 |     9   (0)| 00:00:01 |
18------------------------------------------------------------------------------------------
19
20Predicate Information (identified by operation id):
21---------------------------------------------------
22
23   2 - filter("T"."ID"=3)
24
25
26Statistics
27----------------------------------------------------------
28          1  recursive calls
29          0  db block gets
30         23  consistent gets
31          0  physical reads
32          0  redo size
33        526  bytes sent via SQL*Net to client
34        519  bytes received via SQL*Net from client
35          2  SQL*Net roundtrips to/from client
36          0  sorts (memory)
37          0  sorts (disk)
38          1  rows processed

可见,预估行数为2800,和直方图中存储的值吻合(2900-100)。

当目标列有频率直方图并且对目标列施加等值查询条件时,如果查询条件的输入值不等于目标列的任意一个Bucket的ENDPOINT_VALUE,那么cardinality=MIN(Current_ENDPOINT_NUMBER-Previous_ENDPOINT_NUMBER)/2:

 1LHR@orclasm > SELECT COUNT(1) FROM T_ROWS_20170605_LHR T WHERE T.ID=4;
 2
 3  COUNT(1)
 4----------
 5         0
 6
 7Execution Plan
 8----------------------------------------------------------
 9Plan hash value: 612708570
10
11------------------------------------------------------------------------------------------
12| Id  | Operation          | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
13------------------------------------------------------------------------------------------
14|   0 | SELECT STATEMENT   |                     |     1 |     3 |     9   (0)| 00:00:01 |
15|   1 |  SORT AGGREGATE    |                     |     1 |     3 |            |          |
16|*  2 |   TABLE ACCESS FULL| T_ROWS_20170605_LHR |     1 |     3 |     9   (0)| 00:00:01 |
17------------------------------------------------------------------------------------------
18
19Predicate Information (identified by operation id):
20---------------------------------------------------
21
22   2 - filter("T"."ID"=4)
23
24
25Statistics
26----------------------------------------------------------
27          1  recursive calls
28          0  db block gets
29         23  consistent gets
30          0  physical reads
31          0  redo size
32        525  bytes sent via SQL*Net to client
33        519  bytes received via SQL*Net from client
34          2  SQL*Net roundtrips to/from client
35          0  sorts (memory)
36          0  sorts (disk)
37          1  rows processed
38LHR@orclasm > select round(1/2) from dual;
39
40ROUND(1/2)
41----------
42         1

在直方图中,由于MIN(Current_ENDPOINT_NUMBER-Previous_ENDPOINT_NUMBER)=1,所以,ROUND(1/2)=1,和执行计划中的预估行数相吻合。

作者:小麦苗

 

 

【DB笔试面试646】在Oracle中,什么是基数(Cardinality)和可选择率(Selectivity)?_sql语句_03