ADG 实例异常终止故障分析报告

来源:这里教程网 时间:2026-03-03 19:04:01 作者:

问题处理

问题描述

2023-05-25 xcrmdb1 ADG 发生报障, SSC 在之前邮件及下面 SR 中进行了跟踪与分析;本报告对这起故障进行综合汇报,并提供原因及建议。

故障分析

1.  Instance2 lmsc 进程收到错误序列的 UDP 包,这个重要进程退出,导致实例被逼终止。

>>>alert_crmdb12.log Thu May 25 05:52:38 2023 Archived Log entry 48804 added for thread 1 sequence 77393 ID 0x3d8c0bae dest 1: Thu May 25  05:58:06  2023 Errors in file /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/trace/crmdb12_lmsc_4967.trc  (incident=384241): ORA-00600: internal error code, arguments: [ kjctr_pbmsg:badseq ], [32], [0], [16777216], [], [], [], [], [], [], [], [] <<<lmsc 进程收到错误的信息包,下面需要重启 Instance 来保证 DB 的完整性 Incident details in: /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/incident/incdir_384241/crmdb12_lmsc_4967_i384241.trc Thu May 25 05:58:10 2023 Dumping diagnostic data in directory=[cdmp_20170525055810], requested by (instance=2, osid=680322 (LMSC)), summary=[incident=384241]. Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Thu May 25 05:58:12 2023 Errors in file /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/trace/crmdb12_lmsc_4967.trc: ORA-00600: internal error code, arguments: [ kjctr_pbmsg:badseq ], [32], [0], [16777216], [], [], [], [], [], [], [], [] Thu May 25 05:58:12  2023 USER (ospid: 4967):  terminating the instance  due to error 484 >>>lms 进程异常, Instance 需要中断 Thu May 25 05:58:14 2023 License high water mark = 866 Thu May 25 05:58:17 2023 Instance terminated by USER, pid = 4967 Thu May 25 05:58:18 2023 USER (ospid: 17292): terminating the instance Thu May 25 05:58:18 2023 Instance terminated by USER, pid = 17292 Thu May 25 05:59:11 2023

Starting ORACLE instance (normal) (OS id: 17791)

>>>alert_crmdb11.log Thu May 25 05:58:14  2023 Reconfiguration started (old inc 6, new inc 8) List of instances (total 1) : 1 Dead instances (total 1) : 2 My inst 1   >>>Instance1 感知 Instance2 退出,开始 Reconfiguration   05:58:06 Node1:[kjctr_pbmsg:badseq] 05:58:12 Node1:[kjctr_pbmsg:badseq] 05:58:12 Node1:terminating the instance 05:58:14 Node2:Reconfiguration started 05:58:18 Node1:terminating the instance 05:59:11 Node1:Starting ORACLE instance

 

2.   lmsc kjctr_pbmsg:badseq 处收到错误的包而 abort

>>>crmdb12_lmsc_4967.trc *** 2023-05-25 05:58:06.823 ORA-00600: internal error code, arguments: [kjctr_pbmsg:badseq], [32], [0], [16777216], [], [], [], [], [], [], [], []   kjmsm: caught non-fatal error 600 lms abort after exception 600 <<<lmsc 出现 600 异常而终止 (error 484) KJC Communication Dump: -------------------8<-------------------

 

>>>crmdb12_lmsc_4967_i384241.trc Error Stack: ORA-600[kjctr_pbmsg:badseq] Main Stack:     kjctr_pbmsg  <- kjctr_watq <- kjctr_rksxp <- kjctrcv <- kjcsrmg <- kjmsm <- ksbrdp <- opirip    <- opidrv <- sou2o <- opimai_real <- ssthrdmain <- main <- main_opd_entry   ----- Incident Context Dump ----- Address: 0x9fffffffffff59f0 Incident ID: 384241 Problem Key: ORA 600 [kjctr_pbmsg:badseq] Error: ORA-600 [kjctr_pbmsg:badseq] [32] [0] [16777216] [] [] [] [] [] [] [] [] [00]: dbgexProcessError [diag_dde] [01]: dbgeExecuteForError [diag_dde] [02]: dbgePostErrorKGE [diag_dde] [03]: dbkePostKGE_kgsf [rdbms_dde] [04]: kgeadse [] [05]: kgerinv_internal [] [06]: kgerinv [] [07]: kgeasnmierr [] [08]: kjctr_pbmsg []<-- Signaling [09]: kjctr_watq [] [10]: kjctr_rksxp [] [11]: kjctrcv [] [12]: kjcsrmg [] [13]: kjmsm [RAC_MLMDS] [14]: ksbrdp [background_proc] [15]: opirip [OPI] [16]: opidrv [OPI] [17]: sou2o [] [18]: opimai_real [OPI] [19]: ssthrdmain [] [20]: main [] [21]: main_opd_entry [] >>> 出错的 function kjctr_pbmsg

  3. 05:16:43~05:21:45/05:42:51~05:45:52/05:57:56~05:58:26 这三个波段有突发性的分片流量,导致分片超时丢弃和队列溢出以及 UDP 校验错。 最后一波虽然最小持续最短,但是按概率导致 lmsc 进程收到错误的包而 abort ,最终 Instance2 中断。 4. 放大一下局部细节: ( 第二张图里 : 溢出是红色,校验错是绿色,超时丢弃是蓝色 )

5.  相关的统计项如下

udp:                                            20067 incomplete headers                      9124 bad checksums    <<<UDP 校验错                          ip:                                             872460826 fragments received                  16823 fragments dropped (dup or out of space)   <<< 分片溢出 3712 fragments dropped after timeout   <<< 分片超时丢弃        

6.  分片流量突发,引起分片队列溢出,进而出现分片超时丢弃和 UDP校验错。在这种情形下,上层 DB进程存在一定概率收到错误的包,引发实例故障。建议调整如下 OS Kernel参数:

ip_fragment_timeout 这个调整为 1s ip_reass_mem_limit    这个调整为 10M    

  问题总结

问题描述:

实例进程 lms 出现 ”ORA-00600: internal error code, arguments: [kjctr_pbmsg:badseq], [32], [0], [16777216], [], [], [], [], [], [], [], []” abort ,导致 Instance 终止

适用范围:

所有多 Node RAC 结构,故障主要发生在 lms 进程;不限于 DB 版本。

问题现象:

lms 进程收到错误的信息包,需要重启 Instance

>>>alert_crmdb12.log Thu May 25 05:52:38 2023 Archived Log entry 48804 added for thread 1 sequence 77393 ID 0x3d8c0bae dest 1: Thu May 25  05:58:06  2023 Errors in file /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/trace/crmdb12_lmsc_4967.trc  (incident=384241): ORA-00600: internal error code, arguments: [ kjctr_pbmsg:badseq ], [32], [0], [16777216], [], [], [], [], [], [], [], [] <<<lmsc 进程收到错误的信息包,下面需要重启 Instance 来保证 DB 的完整性 Incident details in: /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/incident/incdir_384241/crmdb12_lmsc_4967_i384241.trc Thu May 25 05:58:10 2023 Dumping diagnostic data in directory=[cdmp_20170525055810], requested by (instance=2, osid=680322 (LMSC)), summary=[incident=384241]. Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Thu May 25 05:58:12 2023 Errors in file /oracle/app/oracle/diag/rdbms/xcrmdb1/crmdb12/trace/crmdb12_lmsc_4967.trc: ORA-00600: internal error code, arguments: [ kjctr_pbmsg:badseq ], [32], [0], [16777216], [], [], [], [], [], [], [], [] Thu May 25 05:58:12  2023 USER (ospid: 4967):  terminating the instance  due to error 484 >>>lms 进程异常, Instance 需要中断 Thu May 25 05:58:14 2023 License high water mark = 866 Thu May 25 05:58:17 2023 Instance terminated by USER, pid = 4967 Thu May 25 05:58:18 2023 USER (ospid: 17292): terminating the instance Thu May 25 05:58:18 2023 Instance terminated by USER, pid = 17292 Thu May 25 05:59:11 2023 Starting ORACLE instance (normal) (OS id: 17791)

OSW netstat 统计

udp:                                            20067 incomplete headers                      9124 bad checksums     <<<UDP 校验错                           ip:                                             872460826 fragments received                  16823 fragments dropped (dup or out of space)   <<< 分片溢出 3712 fragments dropped after timeout   <<< 分片超时丢弃         

问题原因: 分片流量突发,引起分片队列溢出,进而出现分片超时丢弃和 UDP 校验错。在这种情形下,上层 DB 进程存在一定概率收到错误的包,引发实例故障

 

解决办法:

对于 HP OS ,在 Kernel 调整相关的分片队列参数,一般是如下两个参数。

ip_fragment_timeout 这个调整为 1s

ip_reass_mem_limit    这个调整为 10M    

 

相关推荐