数据库监听夯故障分析

来源:这里教程网 时间:2026-03-03 19:04:13 作者:

背景情况简述

2023年 11 05 16:24:11   N CRM2 2   C rash

故障分析

full   picture

2023-11-05 15:05:00    节点 1 开始出现 IP fragments dropped

2023-11-05 16:17:09   udp overflow 29741 to 231111

2023-11-05 16:18:08   LCK1 (ospid: 11324) waits for event

                  ~16:20:29   'libcache interrupt action by LCK' for 216 secs    <<<<< 16:20:29 减去 216 秒为 16:16:53   约为网络 udp   overflow 的时间点。

2023-11-05 16:22:39    节点 1 LMD0 (ospid: 16620) has detected no messaging activity from instance 2

2023-11-05 16:22:57    节点 1   IPC Send timeout detected. Sender: ospid 16620

2023-11-05 16:24:10   节点 1 Reconfiguration started (old inc 4, new inc 8)

2023-11-05 16:24:11    节点 2 Instance terminated by LMSh, pid = 11314

2023-11-05 16:23:14   Initiating reconfig, reason=3

2023-11-05 16:24:39  Reconfiguration complete (total time 28.5 secs) .

2 osw 分析

N etstat 中在故障时间点存在严重的 UDP socket overflows

 

详细数据   …… 6:14:08 udp:   29741 socket overflows 16:14:38 udp:   29741 socket overflows 16:15:08 udp:   29741 socket overflows 16:15:38 udp:   29741 socket overflows 16:16:08 udp:   29741 socket overflows 16:16:38 udp:   29741 socket overflows 16:17:09 udp:   231111 socket overflows 16:17:39 udp:   231114 socket overflows 16:18:09 udp:   231121 socket overflows 16:18:39 udp:   231121 socket overflows 16:19:09 udp:   231121 socket overflows 16:19:39 udp:   231121 socket overflows 16:20:10 udp:   231121 socket overflows ……

3 A lert   log

节点2的LCK1 hung了216秒,之后crash 节点1检测到之后,发起reconfiguration

节点 1: …… Thu Nov 05 16:22:39 2023 LMD0 (ospid: 16620) has detected no messaging activity from instance 2 Thu Nov 05 16:22:57 2023 IPC Send timeout detected. Sender: ospid 16620 [oracle@ncrmdb21 (LMD0)] Receiver: inst 2 binc 1349400 ospid 165544 Thu Nov 05 16:23:14 2023 Errors in file /oracle/app/oracle/diag/rdbms/ncrmdb2/crmdb21/trace/crmdb21_lmd0_16620.trc: ORA-27508: IPC error sending a message Thu Nov 05 16:23:14 2023 IPC Send timeout to 2.0 inc 4 for msg type 65518 from opid 14 Thu Nov 05 16:23:14 2023 Communications reconfiguration: instance_number 2 by ospid 16620 Thu Nov 05 16:24:03 2023 Detected an inconsistent instance membership by instance 1 Evicting instance 2 from cluster Waiting for instances to leave: 2 Thu Nov 05 16:24:10 2023 Reconfiguration started (old inc 4, new inc 8) List of instances (total 1) : 1 Dead instances (total 1) : 2 My inst 1    ……

节点 2 …… Thu Nov 05 16:18:08 2023 LCK1 (ospid: 11324) waits for event 'libcache interrupt action by LCK' for 75 secs. Thu Nov 05 16:18:08 2023 LCK1 (ospid: 11324) is hung in an acceptable location (libcache 0x41.02). Thu Nov 05 16:20:29 2023 LCK1 (ospid: 11324) waits for event 'libcache interrupt action by LCK' for 216 secs. Thu Nov 05 16:20:29 2023 LCK1 (ospid: 11324) is hung in an acceptable location (libcache 0x41.02). Thu Nov 05 16:24:03 2023 Detected an inconsistent instance membership by instance 1 Thu Nov 05 16:24:04 2023 Received an instance abort message from instance 1 Thu Nov 05 16:21:02 2023 opiodr aborting process unknown ospid (20468) as a result of ORA-609 Thu Nov 05 16:22:39 2023 LMD0 (ospid: 16620) has detected no messaging activity from instance 2 Thu Nov 05 16:22:57 2023 IPC Send timeout detected. Sender: ospid 16620 [oracle@ncrmdb21 (LMD0)] Receiver: inst 2 binc 1349400 ospid 165544 Thu Nov 05 16:23:14 2023 Errors in file /oracle/app/oracle/diag/rdbms/ncrmdb2/crmdb21/trace/crmdb21_lmd0_16620.trc: ORA-27508: IPC error sending a message Thu Nov 05 16:23:14 2023 IPC Send timeout to 2.0 inc 4 for msg type 65518 from opid 14 Thu Nov 05 16:23:14 2023 Communications reconfiguration: instance_number 2 by ospid 16620 Thu Nov 05 16:24:03 2023 Detected an inconsistent instance membership by instance 1 Evicting instance 2 from cluster Waiting for instances to leave: 2 …… Thu Nov 05 16:24:04 2023 LMSh (ospid: 11314): terminating the instance due to error 481 ……  

ashdump 分析

1 )故障时间点的 ashdump 存在缺失,数据库整体 hang

 

5. Lmon trace

lmon trace中显示reson为3(reconflguration reason is 3 - which suggested a communication issue) …… *** 2023-11-05 16:23:14.943 2023-11-05 16:23:14.943607 : kjxgrrcfg: done (device ok) - ret = 3  hist 0x1679a (initial rsn: 3, cachd rsn: 0) kjxgrrcfgchk: Initiating reconfig, reason=3 kjxgrrcfgchk: COMM rcfg - Disk Vote Required kjfmReceiverHealthCB_CheckAll: Recievers are healthy. 2023-11-05 16:23:14.943885 : kjxgrnetchk: start 0x8c02716f, end 0x8c0327b9 2023-11-05 16:23:14.943901 : kjxgrnetchk: Network Validation wait: 46 sec kjxgrnetchk: ce-event: from inst 1 to inst 2 ver 0x149939 kjxgrrcfgchk: prev pstate 6  mapsz 512 kjxgrrcfgchk: new  bmp: 1 2 kjxgrrcfgchk: work bmp: 1 2 kjxgrrcfgchk: rr  bmp: 1 2 ……

官方解释:

Reason 0 = No reconfiguration

Reason 1 = The Node Monitor generated the reconfiguration.

Reason 2 = An instance death was detected.

Reason 3 = Communications Failure

Reason 4 = Reconfiguration after suspend

 

Reason 3: Communications Failure. This can happen if:

a) The LMON processes lose communication between one another.

b) One instance loses communications with the LMS, LMD, process of another instance.

c) The LCK processes lose communication between one another.

d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck and is not responding to remote requests.

分析结果

本次节点 2的crash是因为 IPC Send timeout 导致,而IPC Send timeout又和主机上的UDP socket overflows激增的时间点匹配,判断为网络上的UDP socket overflows导致节点驱逐。

相关推荐