问题描述:研发反馈测试环境跑批量任务时总是出现死锁,而可以确认的是相同ACCT_NO不会出现在不同进程,因此就不存在多个进程update相同数据的可能性.
数据库:oracle 11.2.0.4 rac
后台日志:
2023-11-17T08:43:33.626029+08:00
Global Enqueue Services Deadlock detected (DID = 3_0_48). More information in file
 /u01/app/oracle/diag/rdbms/orclupg/hisdb1/trace/hisdb1_lmd0_17772.trc.
2023-11-17T08:43:50.715420+08:00
Global Enqueue Services Deadlock detected (DID = 3_0_49). More information in file
 /u01/app/oracle/diag/rdbms/orclupg/hisdb1/trace/hisdb1_lmd0_17772.trc.

1、问题分析
--根据gv$active_session_history查询到该时间段存在两个会话相互阻塞的相关信息.
alter session set nls_date_format='yyyy-mm-dd hh24:mi:ss';
select sql_id,
      sql_exec_id,
	  sql_exec_start,
      to_char(sample_time, 'yyyymmdd hh24:mi:ss') st,
      inst_id,
      session_id,
      sql_id,
      machine,
      event,
      blocking_inst_id,
      blocking_session
 from gv$active_session_history
where sample_time > to_date('20231117 08:43', 'yyyymmdd hh24:mi')
  and sample_time < to_date('20231117 08:44', 'yyyymmdd hh24:mi')
order by st;

sql_id           sql_exec_id sql_exec_start      st                inst_id session_id  machine      event                         blocking_inst_id   blocking_session
---------------  ----------- ------------------- ----------------- ------- ----------- ------------ ----------------------------  -----------------  ----------------------------
82gz6jhs91s4k    17928968    2023-11-17 08:42:55 20231117 08:43:33 1       2182        orclxgea03  enq: TX - allocate ITL entry  1                  3437
82gz6jhs91s4k    17928938    2023-11-17 08:42:51 20231117 08:43:33 1       3437        orclxgea04  enq: TX - allocate ITL entry  1                  2182
......																                                                                                 
82gz6jhs91s4k    17924183    2023-11-17 08:42:08 20231117 08:43:50 1       4231        orclxgea01  enq: TX - allocate ITL entry  2                  1056
82gz6jhs91s4k    49682925    2023-11-17 08:42:51 20231117 08:43:50 2       1056        orclxgea09  enq: TX - allocate ITL entry  1                  4231

--查询SQL语句
select sql_text from v$sqltext where sql_id='82gz6jhs91s4k';

--查询绑定变量
SQL> select b.name,b.position,b.datatype_string,b.value_string,b.last_captured from v$sql_bind_capture b where b.sql_id='82gz6jhs91s4k';

NAME              POSITION DATATYPE_STRING VALUE_STRING         LAST_CAPTURED
--------------- ---------- --------------- -------------------- -------------------
:1                       1 VARCHAR2(32)
:2                       2 VARCHAR2(32)
:3                       3 NUMBER
:4                       4 VARCHAR2(32)
:5                       5 NUMBER
:6                       6 VARCHAR2(32)
:7                       7 VARCHAR2(32)
:8                       8 VARCHAR2(32)
:9                       9 NUMBER
:10                     10 VARCHAR2(32)
:11                     11 VARCHAR2(32)

NAME              POSITION DATATYPE_STRING VALUE_STRING         LAST_CAPTURED
--------------- ---------- --------------- -------------------- -------------------
:12                     12 VARCHAR2(32)
:13                     13 VARCHAR2(32)
:14                     14 VARCHAR2(128)   191034056069361769   2023-11-17 08:40:58
:15                     15 VARCHAR2(128)   Pint           2023-11-17 08:40:58
:16                     16 VARCHAR2(32)    9999                 2023-11-17 08:40:58

--查历史绑定变量
select b.name,b.datatype_string,b.value_string,b.last_captured from dba_hist_sqlbind b where b.sql_id='82gz6jhs91s4k' order by 4;

NAME            DATATYPE_STRING VALUE_STRING         LAST_CAPTURED
--------------- --------------- -------------------- -------------------
:16             VARCHAR2(32)    9999                 2023-11-17 08:40:58
:15             VARCHAR2(128)   Pint                 2023-11-17 08:40:58
:14             VARCHAR2(128)   191034056069361769   2023-11-17 08:40:58
:16             VARCHAR2(32)    9999                 2023-11-17 08:41:01
:15             VARCHAR2(128)   Pint                 2023-11-17 08:41:01
:14             VARCHAR2(128)   181034000069361706   2023-11-17 08:41:01
                                
2、分析结果
造成此异常的直接原因为事务一直没有提交,数据库一直处于ITL卡槽分配等待状态,最终事务超时.
系统为什么会产生enq: TX - allocate ITL entry等待事件?
经过分析确认批量业务处理时会按数据量使用多线程处理,程序按照业务的任务数量分配线程,当所有线程处理结束后统一提交事务。
1) 业务程序处理大量数据,主线程等待所有线程处理完成后才会被唤醒执行提交事务;
2) 子线程中执行程序中的UPDATE时多个事务让ORACLE分配ITL事务卡槽,而discountInstance数据块的卡槽已全部分配,此时便会等其他使用卡槽的事务释放,而其他事务也在等待主线程通知commit,最终造成系统程序的主线程与Oracle事务中ITL卡槽分配互锁.

3、解决方案
--通过增加INITRANS或者PCTFREE来解决该异常,以下为具体实施步骤.
3.1、查表AT_ACT_BIL信息
--生产环境
SQL> select table_name,tablespace_name,ini_trans,max_trans,OWNER,PCT_FREE,PCT_USED from all_tables where table_name='AT_ACT_BIL';

TABLE_NAME                     TABLESPACE_NAME                 INI_TRANS  MAX_TRANS OWNER             PCT_FREE   PCT_USED
------------------------------ ------------------------------ ---------- ---------- --------------- ---------- ----------
AT_ACT_BIL                     TBS_orcl_DATA                          16        255 orcl                   20

--测试环境
SQL> select table_name,tablespace_name,ini_trans,max_trans,OWNER,PCT_FREE,PCT_USED from all_tables where table_name='AT_ACT_BIL';

TABLE_NAME                     TABLESPACE_NAME                 INI_TRANS  MAX_TRANS OWNER             PCT_FREE   PCT_USED
------------------------------ ------------------------------ ---------- ---------- --------------- ---------- ----------
AT_ACT_BIL                     TBS_orcl_DATA                           1        255 orcl                   10

3.2、修改表AT_ACT_BIL参数
--测试环境修改表AT_ACT_BIL参数
SQL> alter table TIMER.AT_ACT_BIL initrans 16;

Table altered.

SQL> alter table TIMER.AT_ACT_BIL pctfree 20;

Table altered.

SQL> select table_name,tablespace_name,ini_trans,max_trans,OWNER,PCT_FREE,PCT_USED from all_tables where table_name='AT_ACT_BIL';

TABLE_NAME      TABLESPACE_NAME                 INI_TRANS  MAX_TRANS OWNER                  PCT_FREE   PCT_USED
--------------- ------------------------------ ---------- ---------- -------------------- ---------- ----------
AT_ACT_BIL      TBS_orcl_DATA                          16        255 TIMER                       20

3.3、re-organize table
SQL> alter table TIMER.AT_ACT_BIL move;

Table altered.

后台日志:
2023-11-17T14:55:13.656224+08:00
Thread 1 advanced to log sequence 3270 (LGWR switch),  current SCN: 16570803317
  Current log# 2 seq# 3270 mem# 0: +DATA/orclUPG/ONLINELOG/group_2.837.1151426175
2023-11-17T14:55:15.535886+08:00
ARC0 (PID:18213): Archived Log entry 8237 added for T-1.S-3269 ID 0x8b1b2026 LAD:1
2023-11-17T14:55:22.598016+08:00
Some indexes or index [sub]partitions of table TIMER.AT_ACT_BIL have been marked unusable
2023-11-17T14:55:51.651350+08:00

3.4、rebuild index
SQL> select INDEX_NAME,TABLE_OWNER,TABLE_NAME,TABLESPACE_NAME from dba_indexes  where TABLE_NAME='AT_ACT_BIL'

INDEX_NAME                TABLE_OWNER               TABLE_NAME      TABLESPACE_NAME
------------------------- ------------------------- --------------- ------------------------------
IDX_AT_ACT_BIL_AL         TIMER                     AT_ACT_BIL      TBS_orcl_IDX
IDX_UK_AT_ACT_BIL_ABL     TIMER                     AT_ACT_BIL      TBS_orcl_IDX

SQL> alter index TIMER.IDX_AT_ACT_BIL_AL rebuild pctfree 20 initrans 16;
SQL> alter index TIMER.IDX_UK_AT_ACT_BIL_ABL rebuild pctfree 20 initrans 16;

说明:此后恢复正常.