利用systemstate dump trace文件寻找dg terminated原因

来源:这里教程网 时间:2026-03-03 12:54:20 作者:

上午收到告警,一台 dataguard 宕机了。软件版本为 11.2.0.4. 登录上来首先查看 alert 文件 .

Fri Jan 25   11:00:39 2019 System   State dumped to trace file /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_ora_42298.trc Fri Jan 25   11:00:54 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_ora_42298.trc: Fri Jan 25   11:00:58 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:01:19 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170: deadlocked   on readable physical standby (undo segment 65535) Fri Jan 25   11:01:39 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:01:59 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:02:19 2019 Errors in file /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:02:40 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:03:00 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:03:21 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:03:41 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:04:01 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:04:21 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:04:41 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:05:01 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:05:21 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:05:27 2019 RFS[3]: Selected   log 9 for thread 1 sequence 85314 dbid -1036715693 branch 899829907 Fri Jan 25   11:05:27 2019 Media Recovery   Waiting for thread 1 sequence 85314 (in transit) Recovery of   Online Redo Log: Thread 1 Group 9 Seq 85314 Reading mem 0   Mem# 0:   /san/data/oradata/EIN1/standby_redo03.log Fri Jan 25   11:05:27 2019 Archived Log   entry 165577 added for thread 1 sequence 85313 ID 0xc2352453 dest 1: Fri Jan 25 11:05:41   2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:06:01 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:06:21 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:06:41 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_pmon_56104.trc: ORA-03170:   deadlocked on readable physical standby (undo segment 65535) Fri Jan 25   11:06:45 2019 Errors in file   /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_lgwr_56177.trc: ORA-04020:   deadlock detected while trying to lock object SYS.ERAC LGWR (ospid:   56177): terminating the instance due to error 4020 Fri Jan 25   11:06:46 2019 System state dump   requested by (instance=1, osid=56177 (LGWR)), summary=[abnormal instance   termination]. System State   dumped to trace file /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_diag_56159_20190125110646.trc  

可以看到持续大量的 ORA-03170 死锁告警, 11:06:46 分的时候实例被终止。在贴出的 alert 日志开头生成了一个 trace 文件,首先分析一下这个文件。

Vi 编辑 /u01/app/oracle/diag/rdbms/eracpdg01/ERAC/trace/ERAC_ora_42298.trc 文件,搜索关键字 final ,定位到下面这段:

      SO:   0xa1cccae30, type: 4, owner: 0xa0c7c8da8 , flag: INIT/-/-/0x00 if: 0x3 c: 0x3      proc=0xa0c7c8da8, name=session,   file=ksu.h LINE:12729, pg=0     (session) sid: 2228 ser: 9 trans: (nil),   creator: 0xa0c7c8da8               flags: (0x45) USR/- flags_idl:   (0x1) BSY/-/-/-/-/-               flags2: (0x40009) -/-/INC               DID: , short-term DID:               txn branch: (nil)               edition#: 100              oct: 3, prv: 0, sql:   0xa33cf5e08, psql: 0xa3e7cdec0, user: 627/CONVERT_R     ksuxds FALSE at location: 0     service name: SYS$USERS     client details:       O/S info: user: EM-H6VQXD2$, term:   unknown, ospid: 1234       machine: EM-H6VQXD2 program: JDBC Thin   Client       application name: JDBC Thin Client,   hash value=2546894660     Current Wait Stack:      0: waiting for 'library cache lock'         handle address=0xa3ff9eb18, lock   address=0x50de690a0, 100*mode+namespace=0x1004a0002         wait_id=945878 seq_num=28394   snap_id=1         wait times: snap=2 min 53 sec, exc=2   min 53 sec, total=2 min 53 sec         wait times: max=15 min 0 sec, heur=2   min 53 sec         wait counts: calls=60 os=60         in_wait=1 iflags=0x15a2

There is at least one session blocking this session.

      Dumping 1   direct blocker(s):

        inst:   1, sid: 970, ser: 1

      Dumping   final blocker:

        inst:   1, sid: 2059, ser: 26985

    Wait State:

        fixed_waits=0 flags=0x22 boundary=(nil)/-1

    Session   Wait History:

        elapsed   time of 0.000134 sec since current wait

     0: waited   for 'SQL*Net message from client'

        driver id=0x54435000, #bytes=0x1, =0x0

          wait_id=945877 seq_num=28393 snap_id=1

        wait   times: snap=3.470941 sec, exc=3.470941 sec, total=3.470941 sec

        wait   times: max=infinite

        wait   counts: calls=0 os=0

          occurred after 0.000001 sec of elapsed time

     1: waited   for 'SQL*Net message to client'

        driver   id=0x54435000, #bytes=0x1, =0x0

          wait_id=945876 seq_num=28392 snap_id=1

        wait   times: snap=0.000001 sec, exc=0.000001 sec, total=0.000001 sec

        wait   times: max=infinite

        wait   counts: calls=0 os=0

          occurred after 0.000048 sec of elapsed time

可以看到最源头的 blocker sid: 2059, ser: 26985 这个 session

这里的 SO 0xa1cccae30, type: 4 说明是会话信息 , owner: 0xa0c7c8da8 是这个 SO 0xa1cccae30 )的父节点。

搜索这个位置所在的 PROCESS 进程,发现正是 LGWR 进程

PROCESS 17: LGWR

 

然后直接搜索 sid: 2059 ser: 26985

      SO: 0xa18befcd0, type:   4, owner: 0xa18786ed8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3      proc=0xa18786ed8, name=session,   file=ksu.h LINE:12729, pg=0     (session) sid: 2059 ser: 26985 trans:   0xa1b361290, creator: 0xa18786ed8               flags: (0x41) USR/- flags_idl:   (0x1) BSY/-/-/-/-/-               flags2: (0x40009) -/-/INC               DID: , short-term DID:               txn branch: 0xa0f20f998               edition#: 100              oct: 3, prv: 0, sql:   0xa345bb8d0, psql: 0xa2fe3c0a8, user: 510/READER     ksuxds FALSE at location: 0     service name: SYS$USERS     client details:       O/S info: user: Administrator, term:   ADMIN-PC, ospid: 8960:8832       machine: WORKGROUP\ADMIN-PC program:   plsqldev.exe       application name: PL/SQL Developer,   hash value=1190136663       action name: SQL 窗口 - ISIN 香港净值 .sql , hash value=506963986     Current Wait Stack:      0: waiting for 'SQL*Net message from dblink '         driver id=0x28444553, #bytes=0x1,   =0x0         wait_id=474 seq_num=475 snap_id=1         wait times: snap=4 min 45 sec, exc=4   min 45 sec, total=4 min 45 sec         wait times: max=infinite, heur=4 min   45 sec         wait counts: calls=0 os=0         in_wait=1 iflags=0x5a0     There are 28 sessions blocked by this   session.

可以看到是 reader 用户从一台 Windows PL/SQL developer 发起的一段 sql 文本导致的。而且这段 sql 使用了 db_link 。已经持续了 4 45 秒,接近 5 分钟的时间。该会话阻塞了 28 个会话。

目前为止完整的阻塞情况可能并没有全部得到,但是已经得到了 LGWR 进程被一个 PL/SQL developer 会话所阻塞的结论,在 sql 文本执行了接近五分钟的时候,实例被 LGWR 终止。

这里为什么是五分钟呢?这是因为之前发生过的其他故障,那时修改了一个隐含参数 _adg_parselock_timeout

SQL> col name   for a30 SQL> col value   for a20 SQL> col   description for a70 SQL> set line   150 SQL> select   a.ksppinm name,b.ksppstvl value,a.ksppdesc description   2        from x$ksppi a,x$ksppcv b   3       where a.inst_id = USERENV ('Instance')   4         and b.inst_id = USERENV ('Instance')   5         and a.indx = b.indx   6         and upper(a.ksppinm) LIKE upper('%adg_parselock_timeout%')   7         order by name;   NAME                           VALUE                DESCRIPTION ------------------------------   --------------------   ---------------------------------------------------------------------- _adg_parselock_timeout         30000                timeout for parselock get on   ADG in centiseconds _adg_parselock_timeout_sleep   100                  sleep duration after a   parselock timeout on ADG in milliseconds   SQL> col name   for a30; SQL> col value   for a10; SQL> select *   from   2    (select   3    x.ksppinm name,   4    y.ksppstvl value,   5    y.ksppstdf isdefault,   6    decode(bitand(y.ksppstvf,7),1,'MODIFIED',4,'SYSTEM_MOD','FALSE')   ismod,   7    decode(bitand(y.ksppstvf,2),2,'TRUE','FALSE') isadj   8    from   9    sys.x$ksppi x,  10    sys.x$ksppcv y  11    where  12    x.inst_id = userenv('Instance') and  13    y.inst_id = userenv('Instance') and  14    x.indx = y.indx  15    order by  16    translate(x.ksppinm, ' _', ' ')) T where T.name like   '%adg_parselock_timeout%';   NAME                           VALUE      ISDEFAULT                   ISMOD                          ISADJ ------------------------------   ---------- --------------------------- ------------------------------   --------------- _adg_parselock_timeout         30000      FALSE                       FALSE                          FALSE _adg_parselock_timeout_sleep   100          TRUE                        FALSE                          FALSE

_adg_parselock_timeout 该值设置为了 30000 厘秒,约等于 300 秒,也就是五分钟。

搜索 MOS 后确认为 BUG

Bug 18515268 ORA-4020 in ADG Standby Database causing   instance crash by LGWR

Solution

Apply interim patch 18515268, if available for your platform and   Oracle version. Fixed releases: 11.2.0.4.BP20 12.1.0.2 12.2.0.1

 

相关推荐