数据库监听夯故障分析

来源：这里教程网时间：2026-03-03 19:04:13 作者：

背景情况简述

2023年 11 月 05 日 16:24:11 ， N CRM2 2 C rash

故障分析

full picture

2023-11-05 15:05:00 节点 1 开始出现 IP fragments dropped

2023-11-05 16:17:09 udp overflow 29741 to 231111

2023-11-05 16:18:08 LCK1 (ospid: 11324) waits for event

~16:20:29 'libcache interrupt action by LCK' for 216 secs <<<<< 16:20:29 减去 216 秒为 16:16:53 约为网络 udp overflow 的时间点。

2023-11-05 16:22:39 节点 1 LMD0 (ospid: 16620) has detected no messaging activity from instance 2 。

2023-11-05 16:22:57 节点 1 IPC Send timeout detected. Sender: ospid 16620 。

2023-11-05 16:24:10 节点 1 ： Reconfiguration started (old inc 4, new inc 8) 。

2023-11-05 16:24:11 节点 2 ： Instance terminated by LMSh, pid = 11314 。

2023-11-05 16:23:14 Initiating reconfig, reason=3 。

2023-11-05 16:24:39 Reconfiguration complete (total time 28.5 secs) .

2 、 osw 分析

N etstat 中在故障时间点存在严重的 UDP socket overflows ：

详细数据 …… 6:14:08 udp: 29741 socket overflows 16:14:38 udp: 29741 socket overflows 16:15:08 udp: 29741 socket overflows 16:15:38 udp: 29741 socket overflows 16:16:08 udp: 29741 socket overflows 16:16:38 udp: 29741 socket overflows 16:17:09 udp: 231111 socket overflows 16:17:39 udp: 231114 socket overflows 16:18:09 udp: 231121 socket overflows 16:18:39 udp: 231121 socket overflows 16:19:09 udp: 231121 socket overflows 16:19:39 udp: 231121 socket overflows 16:20:10 udp: 231121 socket overflows ……

3 、 A lert log

节点2的LCK1 hung了216秒，之后crash ；节点1检测到之后，发起reconfiguration

节点 1： …… Thu Nov 05 16:22:39 2023 LMD0 (ospid: 16620) has detected no messaging activity from instance 2 Thu Nov 05 16:22:57 2023 IPC Send timeout detected. Sender: ospid 16620 [oracle@ncrmdb21 (LMD0)] Receiver: inst 2 binc 1349400 ospid 165544 Thu Nov 05 16:23:14 2023 Errors in file /oracle/app/oracle/diag/rdbms/ncrmdb2/crmdb21/trace/crmdb21_lmd0_16620.trc: ORA-27508: IPC error sending a message Thu Nov 05 16:23:14 2023 IPC Send timeout to 2.0 inc 4 for msg type 65518 from opid 14 Thu Nov 05 16:23:14 2023 Communications reconfiguration: instance_number 2 by ospid 16620 Thu Nov 05 16:24:03 2023 Detected an inconsistent instance membership by instance 1 Evicting instance 2 from cluster Waiting for instances to leave: 2 Thu Nov 05 16:24:10 2023 Reconfiguration started (old inc 4, new inc 8) List of instances (total 1) : 1 Dead instances (total 1) : 2 My inst 1 ……

节点 2 ： …… Thu Nov 05 16:18:08 2023 LCK1 (ospid: 11324) waits for event 'libcache interrupt action by LCK' for 75 secs. Thu Nov 05 16:18:08 2023 LCK1 (ospid: 11324) is hung in an acceptable location (libcache 0x41.02). Thu Nov 05 16:20:29 2023 LCK1 (ospid: 11324) waits for event 'libcache interrupt action by LCK' for 216 secs. Thu Nov 05 16:20:29 2023 LCK1 (ospid: 11324) is hung in an acceptable location (libcache 0x41.02). Thu Nov 05 16:24:03 2023 Detected an inconsistent instance membership by instance 1 Thu Nov 05 16:24:04 2023 Received an instance abort message from instance 1 Thu Nov 05 16:21:02 2023 opiodr aborting process unknown ospid (20468) as a result of ORA-609 Thu Nov 05 16:22:39 2023 LMD0 (ospid: 16620) has detected no messaging activity from instance 2 Thu Nov 05 16:22:57 2023 IPC Send timeout detected. Sender: ospid 16620 [oracle@ncrmdb21 (LMD0)] Receiver: inst 2 binc 1349400 ospid 165544 Thu Nov 05 16:23:14 2023 Errors in file /oracle/app/oracle/diag/rdbms/ncrmdb2/crmdb21/trace/crmdb21_lmd0_16620.trc: ORA-27508: IPC error sending a message Thu Nov 05 16:23:14 2023 IPC Send timeout to 2.0 inc 4 for msg type 65518 from opid 14 Thu Nov 05 16:23:14 2023 Communications reconfiguration: instance_number 2 by ospid 16620 Thu Nov 05 16:24:03 2023 Detected an inconsistent instance membership by instance 1 Evicting instance 2 from cluster Waiting for instances to leave: 2 …… Thu Nov 05 16:24:04 2023 LMSh (ospid: 11314): terminating the instance due to error 481 ……

ashdump 分析

（ 1 ）故障时间点的 ashdump 存在缺失，数据库整体 hang ：

5. Lmon trace

lmon trace中显示reson为3(reconflguration reason is 3 - which suggested a communication issue) …… *** 2023-11-05 16:23:14.943 2023-11-05 16:23:14.943607 : kjxgrrcfg: done (device ok) - ret = 3 hist 0x1679a (initial rsn: 3, cachd rsn: 0) kjxgrrcfgchk: Initiating reconfig, reason=3 kjxgrrcfgchk: COMM rcfg - Disk Vote Required kjfmReceiverHealthCB_CheckAll: Recievers are healthy. 2023-11-05 16:23:14.943885 : kjxgrnetchk: start 0x8c02716f, end 0x8c0327b9 2023-11-05 16:23:14.943901 : kjxgrnetchk: Network Validation wait: 46 sec kjxgrnetchk: ce-event: from inst 1 to inst 2 ver 0x149939 kjxgrrcfgchk: prev pstate 6 mapsz 512 kjxgrrcfgchk: new bmp: 1 2 kjxgrrcfgchk: work bmp: 1 2 kjxgrrcfgchk: rr bmp: 1 2 ……