集团某在线系统使用OGG做了同步复制用于二期业务生产使用。有同事过来说复制进程有点异常

-bash-3.2$ ogg

Oracle GoldenGate Command Interpreter for Oracle

Version 11.2.1.0.6 16211226 OGGCORE_11.2.1.0.6_PLATFORMS_130418.1829_FBO

Solaris, sparc, 64bit (optimized), Oracle 11g on Apr 22 2013 15:23:39

 

Copyright (C) 1995, 2013, Oracle and/or its affiliates. All rights reserved.

 

 

 

GGSCI (ODSDB) 1> info all

 

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

 

MANAGER     RUNNING                                           

JAGENT      STOPPED                                           

REPLICAT    RUNNING     RP10        00:00:00      00:00:03    

REPLICAT    RUNNING     RP7         00:00:06      00:00:03    

REPLICAT    RUNNING     RPS1        42:41:14      00:00:01    

REPLICAT    RUNNING     RPS2        00:00:00      00:00:10    

REPLICAT    RUNNING     RPS2A       00:00:00      00:00:00    

REPLICAT    RUNNING     RPS2B       00:00:00      00:00:00    

REPLICAT    RUNNING     RPS3        00:00:00      00:00:02    

REPLICAT    RUNNING     RPS3A       00:00:00      00:00:03    

REPLICAT    RUNNING     RPS3B       00:00:00      00:00:02    

REPLICAT    RUNNING     RPS4        00:00:00      00:00:07    

REPLICAT    RUNNING     RPS4A       00:00:00      00:00:07    

REPLICAT    RUNNING     RPS4B       00:00:00      00:00:07    

REPLICAT    RUNNING     RPS4C       00:00:00      00:00:07    

REPLICAT    RUNNING     RPS4D       00:00:00      00:00:08    

REPLICAT    RUNNING     RPS4E       00:00:00      00:00:07    

REPLICAT    RUNNING     RPS4F       00:00:00      00:00:07    

REPLICAT    RUNNING     RPS4G       00:00:00      00:00:07    

REPLICAT    RUNNING     RPS4H       00:00:00      00:00:07    

REPLICAT    RUNNING     RPS4I       00:00:00      00:00:08    

REPLICAT    RUNNING     RPS4J       00:00:00      00:00:07    

跟他去了终端旁边,发现确实有个rps1进程延迟有点高。看着进程状态是RUNNING,但是根据我维护六年ogg来看,有可能是假象,那么怎么判断呢?

1,查看ggserr.log

2,查看dirrpt下面该进程的dsc文件

3,info 进程名多次,看看RBA是否有变化

 

经过以上几步rps1进程状态是正常的,延迟高问题出在哪里?view params rps1看到该进程只有一张表。那我们看看这个进程到底卡在哪里了

 

SQL> !ps -ef |grep rps1 |grep -v grep

  orao 12169 11012   0   Jun 07 ?         614:58 /odogg/baseogg/replicat PARAMFILE /odsogg/baseogg/dirprm/rps1.prm REPORTFILE /

 

SQL> !ps -ef |grep 12169 |grep -v grep

  orao 12170 12169   1   Jun 07 ?        2315:30 oracleodb (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

  orao 12169 11012   0   Jun 07 ?         614:58 /odogg/baseogg/replicat PARAMFILE /odsogg/baseogg/dirprm/rps1.prm REPORTFILE /

 

SQL> select s.sid,s.SERIAL#,sql_id from v$process p,v$session s where p.addr=s.paddr and p.spid=12170;

 

       SID    SERIAL# SQL_ID

---------- ---------- -------------

      5146        113 5099dwmx3s4mf

 

SQL> /

 

       SID    SERIAL# SQL_ID

---------- ---------- -------------

      5146        113 dzubcf8jm69yx

 

SQL> /

 

       SID    SERIAL# SQL_ID

---------- ---------- -------------

      5146        113 dzubcf8jm69yx

 

SQL> /

 

       SID    SERIAL# SQL_ID

---------- ---------- -------------

      5146        113 dzubcf8jm69yx

询问了一下维护过该系统的同事,说是不是统计信息太旧。顺着思路往下走走

SQL> select OWNER,TABLE_NAME,LAST_ANALYZED from dba_tables where table_name='TF_F_USER_REVGRPTAG';

 

OWNER             TABLE_NAME        LAST_ANALYZED

-------------------------------------------------------------------------------

TTT              TF_F_USER_REXXX     2017-06-26 00:20:56

 

 

SQL>  select table_name,num_rows from dba_tables where owner='TTT' and table_name='TF_F_USER_REXXX';

 

TABLE_NAME              NUM_ROWS

------------------------------ ----------

TF_F_USER_REXXX          409269832

 

SQL> select count(*) from TTT.TF_F_USER_REXXX;

 

  COUNT(*)

----------

 409955510

接下来我们看看该sql执行计划是不是走错了?

wKiom1lbPI-gPRAFAABQCaknQ-0519.jpg

两条执行计划,很明显137632316运行了5400次但是执行时间却远远高于第二个执行计划。

wKioL1lbPKKhgkhLAAHDW7ez5Xk109.jpg

wKiom1lbPKORHZUwAAGw01z_87w545.jpg



sql profile绑定一下执行计划

SQL> @coe_xfr_sql_profile.sql

 

Parameter 1:

SQL_ID (required)

 

Enter value for 1: dzubcf8jm69yx

 

 

PLAN_HASH_VALUE AVG_ET_SECS

--------------- -----------

     2320424968        .004

      137632316      21.119

 

Parameter 2:

PLAN_HASH_VALUE (required)

 

Enter value for 2: 2320424968

 

Values passed to coe_xfr_sql_profile:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

SQL_ID         : "dzubcf8jm69yx"

PLAN_HASH_VALUE: "2320424968"

查看是否绑定成功

SQL>select name from dba_sql_profiles;

 

NAME

------------------------------

coe_dzubcf8jm69yx_2320424968

SYS_SQLPROF_015d075ccdd40000

coe_a5hgtqfq09tcu_229988255

 

在将内存中的该sql语句清除出去

 

SQL>select address,hash_value,executions,parse_calls from v$sql where sql_TEXT like 'UPDATE "TTT"."TF_F_USER_RExxxx" SET "PARTITION_ID" = :a27,"USER_ID" = :a28%';

 

ADDRESS          HASH_VALUE EXECUTIONS PARSE_CALLS

---------------- ---------- ---------- -----------

00000017673EA5E8 4198240878        100           0

00000017CF3E0000  897328139        673           0

00000017982D40D0 1588244011          8           0

000000049D04B1A8  590555101         10           0

00000017AA525230 4271852353          2           0

0000001777FC6A20 3666847391        735           0

00000017B9154A58 1180290247          1           0

 

7 rows selected.

 

SQL>alter session set events '5614566 trace name context forever';

 

Session altered.

 

SQL>exec dbms_shared_pool.purge('00000017673EA5E8,4198240878','C');

 

PL/SQL procedure successfully completed.

 

SQL>exec dbms_shared_pool.purge('00000017CF3E0000,897328139','C');

exec db

PL/SQL procedure successfully completed.

 

SQL>ms_shared_pool.purge('00000017982D40D0,1588244011','C');

 

PL/SQL procedure successfully completed.

 

SQL>exec dbms_shared_pool.purge('000000049D04B1A8,590555101','C');

 

PL/SQL procedure successfully completed.

 

SQL>exec dbms_shared_pool.purge('00000017AA525230,4271852353','C');

 

PL/SQL procedure successfully completed.

 

SQL>exec dbms_shared_pool.purge('0000001777FC6A20,3666847391','C');

 

PL/SQL procedure successfully completed.

 

SQL>exec dbms_shared_pool.purge('00000017B9154A58,1180290247','C');

 

PL/SQL procedure successfully completed.

 

观察ogg复制进程延迟已经开始减小。该隐患顺利解决!

GGSCI (ODSDB) 8> info all

 

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

 

MANAGER     RUNNING                                           

JAGENT      STOPPED                                           

REPLICAT    RUNNING     RP10        00:00:00      00:00:01    

REPLICAT    RUNNING     RP7         00:00:00      00:00:07    

REPLICAT    RUNNING     RPS1        31:10:29      00:00:50    

REPLICAT    RUNNING     RPS2        00:00:00      00:00:00    

REPLICAT    RUNNING     RPS2A       00:00:00      00:00:00    

REPLICAT    RUNNING     RPS2B       00:00:00      00:00:00    

REPLICAT    RUNNING     RPS3        00:00:00      00:00:01    

REPLICAT    RUNNING     RPS3A       00:00:00      00:00:02    

REPLICAT    RUNNING     RPS3B       00:00:00      00:00:01    

REPLICAT    RUNNING     RPS4        00:00:00      00:00:05    

REPLICAT    RUNNING     RPS4A       00:00:00      00:00:04    

REPLICAT    RUNNING     RPS4B       00:00:00      00:00:04    

REPLICAT    RUNNING     RPS4C       00:00:00      00:00:04    

REPLICAT    RUNNING     RPS4D       00:00:00      00:00:04    

REPLICAT    RUNNING     RPS4E       00:00:00      00:00:04    

REPLICAT    RUNNING     RPS4F       00:00:00      00:00:04    

REPLICAT    RUNNING     RPS4G       00:00:00      00:00:05    

REPLICAT    RUNNING     RPS4H       00:00:00      00:00:04    

REPLICAT    RUNNING     RPS4I       00:00:00      00:00:04    

REPLICAT    RUNNING     RPS4J       00:00:00      00:00:04