主机重启排查: 查看一节点操作系统日志/var/log/message,操作系统在17点36分有重启迹象: Nov 27 17:33:57 abcdb01 kernel: martian source 172.16.14.253 from 0.0.0.0, on dev bondeth0 Nov 27 17:33:57 abcdb01 kernel: ll header: ff:ff:ff:ff:ff:ff:00:00:00:00:fe:00:08:00 Nov 27 17:33:57 abcdb01 kernel: martian source 172.16.18.253 from 0.0.0.0, on dev eth0 Nov 27 17:33:57 abcdb01 kernel: ll header: ff:ff:ff:ff:ff:ff:00:00:00:00:fe:00:08:00 Nov 27 17:34:02 abcdb01 kernel: net_ratelimit: 173 callbacks suppressed Nov 27 17:34:02 abcdb01 kernel: martian source 172.16.14.253 from 0.0.0.0, on dev bondeth0 Nov 27 17:34:02 abcdb01 kernel: ll header: ff:ff:ff:ff:ff:ff:00:00:00:00:fe:00:08:00 Nov 27 17:34:02 abcdb01 kernel: martian source 172.16.14.0 from 0.0.0.0, on dev bondeth0 Nov 27 17:34:02 abcdb01 kernel: ll header: ff:ff:ff:Nov 27 17:36:10 abcdb01 kernel: imklog 5.8.10, log source = /proc/kmsg started. Nov 27 17:36:10 abcdb01 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="6713" x-info="http://www.rsyslog.com"] start Nov 27 17:36:10 abcdb01 kernel: Initializing cgroup subsys cpuset Nov 27 17:36:10 abcdb01 kernel: Initializing cgroup subsys cpu Nov 27 17:36:10 abcdb01 kernel: Linux version 2.6.39-400.248.3.el6uek.x86_64 (mockbuild@ca-build44.us.oracle.com) (gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC) ) #1 SMP Wed Mar 11 18:04:34 PDT 2015 Nov 27 17:36:10 abcdb01 kernel: Command line: root=LABEL=DBSYS bootarea=dbsys bootfrom=BOOT ro loglevel=7 panic=60 debug pci=noaer log_buf_len=1m nmi_watchdog=0 nomce transparent_hugepage=never rd_NO_PLYMOUTH audit=1 console=tty1 conso le=ttyS0,115200n8 crashkernel=380M@128M numa=on 继续确认重启的时间通过命令last: root pts/0 172.16.13.22 Wed Nov 27 17:44 - 17:44 (00:00) root pts/1 172.16.13.22 Wed Nov 27 17:43 - 17:44 (00:00) root pts/0 172.16.13.22 Wed Nov 27 17:43 - 17:44 (00:00) reboot system boot 2.6.39-400.248.3 Wed Nov 27 17:35 - 11:28 (17:53) root pts/2 172.16.13.22 Wed Nov 27 17:32 - 17:33 (00:00) root pts/5 172.16.13.22 Wed Nov 27 17:27 - 17:27 (00:00) root pts/2 172.16.13.22 Wed Nov 27 17:26 - 17:27 (00:01) 从以上last输出日志可以看出重启时间发生在17点35分 进一步排查一节点数据库日志: Wed Nov 27 17:29:52 2024 TT01: Standby redo logfile selected for thread 1 sequence 356065 for destination LOG_ARCHIVE_DEST_10 Wed Nov 27 17:30:07 2024 Global transaction identifier hash collision detected. in_gti: GDPDB.a5117bd.193.33.346308p dx_gti: UPSPDB.353b8c41.88.6.459703pWed Nov 27 17:31:07 2024 Global transaction identifier hash collision detected. Wed Nov 27 17:31:08 2024 Fatal NI connect error 12170. Fatal NI connect error 12170. ………… Fatal NI connect error 12170. Tns error struct: ns main err code: 12535 Time: 27-NOV-2024 17:31:38 Tns error struct: ns main err code: 12535 Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12606 ns secondary err code: 12606 ns secondary err code: 12606 ns secondary err code: 12606 nt main err code: 0 ns secondary err code: 12606 nt main err code: 0 nt main err code: 0 nt main err code: 0 nt main err code: 0 TNS-12535: TNS:operation timed out ns secondary err code: 12606 nt main err code: 0 nt secondary err code: 0 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.14.3)(PORT=23409)) Wed Nov 27 17:34:19 2024 WARNING: inbound connection timed out (ORA-3136) Wed Nov 27 17:41:52 2024 Adjusting the default value of parameter inmemory_max_populate_servers from 36 to 30 based on PGA parameters Starting ORACLE instance (normal) (OS id: 30803) 由于类似日志较多,截取部分,从1节点数据库日志可以看出实例在17点31分陆续出现通讯超时的告警,同时数据库在17点41分发生了实例重启。 进一步排查,集群alert告警日志信息: 2024-11-27 17:06:15.997 [OCTSSD(21408)]CRS-2409: The clock on host abcdb01 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2024-11-27 17:29:07.430 [ORAROOTAGENT(48149)]CRS-5818: Aborted command 'check' for resource 'ora.storage'. Details at (:CRSAGF00113:) {0:15:2} in /u01/app/grid/diag/crs/abcdb01/crs/trace/ohasd_orarootagent_root.trc. 2024-11-27 17:31:39.031 [ORAROOTAGENT(48149)]CRS-5818: Aborted command 'check' for resource 'ora.drivers.acfs'. Details at (:CRSAGF00113:) {0:15:2} in /u01/app/grid/diag/crs/abcdb01/crs/trace/ohasd_orarootagent_root.trc. 2024-11-27 17:31:44.987 [ORAROOTAGENT(48149)]CRS-5818: Aborted command 'check' for resource 'ora.storage'. Details at (:CRSAGF00113:) {0:15:2} in /u01/app/grid/diag/crs/abcdb01/crs/trace/ohasd_orarootagent_root.trc. 2024-11-27 17:32:07.371 [ORAROOTAGENT(48149)]CRS-5014: Agent "ORAROOTAGENT" timed out starting process "/u01/app/12.1.0.2/grid/bin/acfsload" for action "check": details at "(:CLSN00009:)" in "/u01/app/grid/diag/crs/abcdb01/crs/trace/oha sd_orarootagent_root.trc" 2024-11-27 17:32:54.168 [ORAROOTAGENT(48149)]CRS-5832: Agent '/u01/app/12.1.0.2/grid/bin/orarootagent_root' was unable to process commands. Details at (:CRSAGF00128:) {0:0:3325} in /u01/app/grid/diag/crs/abcdb01/crs/trace/ohasd_oraroota gent_root.trc. 2024-11-27 17:40:07.258 [OHASD(17067)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 17067 2024-11-27 17:40:07.289 [OHASD(17067)]CRS-0714: Oracle Clusterware Release 12.1.0.2.0. 2024-11-27 17:40:07.312 [OHASD(17067)]CRS-2112: The OLR service started on node abcdb01. 从集群的告警日志中可以看出17点29分跟存储磁盘相关的资源检测异常,17点40分集群重新启动。 进一步下钻排查/u01/app/grid/diag/crs/abcdb01/crs/trace/ohasd_orarootagent_root.trc日志信息: 2024-11-27 17:29:06.580783 :CLSDYNAM:3361801984: [ora.diskmon]{0:15:2} [check] DiskmonAgent::check { ………… 2024-11-27 17:29:13.472582 : AGFW:3370206976: {0:15:2} Command: check for resource: ora.storage 1 1 completed with status: TIMEDOUT 2024-11-27 17:29:13.491383 : AGFW:3368105728: {0:15:2} ora.storage 1 1 state changed from: ONLINE to: UNKNOWN 2024-11-27 17:29:13.491443 : AGFW:3368105728: {0:15:2} ora.storage 1 1 would be continued to monitored! 2024-11-27 17:29:13.491456 : AGFW:3368105728: {0:15:2} Switching offline monitor to intermedite one 2024-11-27 17:29:13.499020 : AGFW:3368105728: {0:15:2} Started implicit monitor for [ora.storage 1 1] interval=60000 delay=60000 2024-11-27 17:29:13.499034 : AGFW:3368105728: {0:15:2} Resource ora.storage 1 1 has transitioned to INTERMEDIATE 2024-11-27 17:29:13.499068 : AGFW:3368105728: {0:15:2} Started timer for [ora.storage 1 1] delay=600000 2024-11-27 17:29:13.499090 : AGFW:3368105728: {0:15:2} ora.storage 1 1 state details has changed from: to: CHECK TIMED OUT 2024-11-27 17:29:13.499104 : AGFW:3368105728: {0:15:5} Generating new Tint for unplanned state change. Original Tint: {0:15:2} ………… 2024-11-27 17:32:11.054930 : AGFW:3368105728: {0:15:6} State = 1 2024-11-27 17:32:11.054935 : AGFW:3368105728: {0:15:6} Check = 0 2024-11-27 17:32:11.054966 : AGFW:3368105728: {0:15:6} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:17310408 2024-11-27 17:32:13.805676 :CLSDYNAM:3359700736: [ora.diskmon]{0:15:2} [check] DiskmonAgent::check { 2024-11-27 17:32:13.805766 :CLSDYNAM:3359700736: [ora.diskmon]{0:15:2} [check] DiskmonAgent::check } - 0 2024-11-27 17:32:14.399974 : USRTHRD:2604566272: HAIP: event GIPCD_METRIC_UPDATE 2024-11-27 17:32:14.400046 : USRTHRD:2606667520: {0:15:2} dequeue change event 0x7fab783ede40, GIPCD_METRIC_UPDATE 2024-11-27 17:32:14.400093 : USRTHRD:2606667520: {0:15:2} InitializeHaIps[ 1] infList 'inf ib1, ip 192.168.10.2, sub 192.168.8.0' 2024-11-27 17:32:14.400102 : USRTHRD:2606667520: {0:15:2} InitializeHaIps[ 0] infList 'inf ib0, ip 192.168.10.1, sub 192.168.8.0' 2024-11-27 17:32:14.400108 : USRTHRD:2606667520: {0:15:2} HAIP: updateCssMbrData for ib0:89;ib1:88 2024-11-27 17:32:14.400120 : USRTHRD:2606667520: {0:15:2} HAIP: enter UpdatecssMbrdata 2024-11-27 17:32:14.400142 : USRTHRD:2606667520: {0:15:2} HAIP: to Update member info HAIP1;192.168.8.0#0;192.168.8.0#1 2024-11-27 17:32:14.400174 : USRTHRD:2606667520: {0:15:2} to verify routes 2024-11-27 17:32:14.400182 : USRTHRD:2606667520: {0:15:2} to verify start completion 2 2024-11-27 17:32:49.811980 :CLSDYNAM:2617243392: [ora.diskmon]{0:15:2} [check] DiskmonAgent::check } - 0 2024-11-27 17:32:52.812672 :CLSDYNAM:3359700736: [ora.diskmon]{0:15:2} [check] DiskmonAgent::check { 2024-11-27 17:32:52.812787 :CLSDYNAM:3359700736: [ora.diskmon]{0:15:2} [check] DiskmonAgent::check } - 0 2024-11-27 17:32:54.155776 : AGFW:3368105728: {0:0:3325} Agent received the message: AGENT_HB[Engine] ID 12293:10097628 2024-11-27 17:32:54.164370 : AGFW:3368105728: {0:0:3325} Command: check for resource: ora.storage 1 1 has excedeed the maximum allowed time limit! 2024-11-27 17:32:54.177730 : AGENT:3368105728: {0:0:3325} {0:0:3325} Created alert : (:CRSAGF00128:) : Command executor module is unable to process the commands. 2024-11-27 17:32:54.181205 : AGENT:3368105728: {0:0:3325} Agfw calling user exitCB, will exit on return 2024-11-27 17:32:54.189009 : AGENT:3368105728: {0:0:3325} returned from user exitCB, exiting 2024-11-27 17:32:54.189044 : AGFW:3368105728: {0:0:3325} Agent is exiting with exit code: 1 2024-11-27 17:33:01.596879 : USRTHRD:2602465024: {0:15:2} mbr 1 changes mbr private data 2024-11-27 17:33:01.597385 : USRTHRD:2602465024: {0:15:2} enqueue private data change event 0x7fab7ce17e00, CssMbrData, HAIP1;192.168.8.0#0;192.168.8.0#1 2024-11-27 17:33:01.597403 : USRTHRD:2602465024: {0:15:2} cssEventThread: it is 6 event 2024-11-27 17:33:01.597437 : USRTHRD:2606667520: {0:15:2} dequeue change event 0x7fab7ce17e00, CssMbrData ………… 2024-11-27 17:33:01.597665 : USRTHRD:2606667520: {0:15:2} HAIP: hag size 2 2024-11-27 17:33:01.597671 : USRTHRD:2606667520: {0:15:2} HAIP: i 0, haip 0, hag size 2, ip 192.168.10.1 2024-11-27 17:33:01.597678 : USRTHRD:2606667520: {0:15:2} HAIP: i 1, haip 1, hag size 2, ip 192.168.10.2 2024-11-27 17:33:01.597684 : USRTHRD:2606667520: {0:15:2} to verify routes 2024-11-27 17:33:01.597693 : USRTHRD:2606667520: {0:15:2} to verify start completion 2 2024-11-27 17:33:01.599070 : USRTHRD:2606667520: {0:15:2} HAIP: assigned ip '169.254.145.189' 2024-11-27 17:33:01.599077 : USRTHRD:2606667520: {0:15:2} HAIP: check ip '169.254.145.189' 2024-11-27 17:33:01.600240 : USRTHRD:2606667520: {0:15:2} HAIP: assigned ip '169.254.53.70' 2024-11-27 17:33:01.600247 : USRTHRD:2606667520: {0:15:2} HAIP: check ip '169.254.53.70' 2024-11-27 17:33:01.600252 : USRTHRD:2606667520: {0:15:2} Start: 2 HAIP assignment, 2, 1 Trace file /u01/app/grid/diag/crs/abcdb01/crs/trace/ohasd_orarootagent_root.trc Oracle Database 12c Clusterware Release 12.1.0.2.0 - Production Copyright 1996, 2014 Oracle. All rights reserved. CLSB:30690816: Argument count (argc) for this daemon is 1 CLSB:30690816: Argument 0 is: /u01/app/12.1.0.2/grid/bin/orarootagent.bin 从ohasd_orarootagent_root日志中可以详细看出,集群在17点29分资源检测异常,持续监测到17点33分超过最大允许时间后,伴随着17点40分集群重新启动。 进一步排查二节点集群日志: 2024-11-27 17:35:17.597 [OCSSD(17237)]CRS-1612: Network communication with node abcdb01 (1) missing for 50% of timeout interval. Removal of this node from cluster in 29.310 seconds 2024-11-27 17:35:32.599 [OCSSD(17237)]CRS-1611: Network communication with node abcdb01 (1) missing for 75% of timeout interval. Removal of this node from cluster in 14.310 seconds 2024-11-27 17:35:41.600 [OCSSD(17237)]CRS-1610: Network communication with node abcdb01 (1) missing for 90% of timeout interval. Removal of this node from cluster in 5.310 seconds 2024-11-27 17:35:45.124 [OHASD(16344)]CRS-8011: reboot advisory message from host: abcdb01, component: cssagent, with time stamp: L-2024-11-27-17:35:45.087 2024-11-27 17:35:45.129 [OHASD(16344)]CRS-8013: reboot advisory message text: Rebooting HUB node after limit 58180 exceeded; disk timeout 58180, network timeout 57620, last heartbeat from CSSD at epoch seconds 1732700086.866, 58210 milli seconds ago based on invariant clock value of 606340488 2024-11-27 17:35:45.129 [OHASD(16344)]CRS-8011: reboot advisory message from host: abcdb01, component: cssmonit, with time stamp: L-2024-11-27-17:35:45.092 2024-11-27 17:35:45.129 [OHASD(16344)]CRS-8013: reboot advisory message text: Rebooting HUB node after limit 58180 exceeded; disk timeout 58180, network timeout 57620, last heartbeat from CSSD at epoch seconds 1732700086.866, 58211 milli seconds ago based on invariant clock value of 606340487 2024-11-27 17:35:46.911 [OCSSD(17237)]CRS-1632: Node abcdb01 is being removed from the cluster in cluster incarnation 345421125 2024-11-27 17:35:47.027 [OCSSD(17237)]CRS-1601: CSSD Reconfiguration complete. Active nodes are abcdb02 . 2024-11-27 17:35:47.085 [CRSD(20928)]CRS-5504: Node down event reported for node 'abcdb01'. 2024-11-27 17:36:25.047 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'Generic'. 2024-11-27 17:36:25.048 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.dbm'. 2024-11-27 17:36:25.049 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl'. 2024-11-27 17:36:25.049 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl_gdsvc'. 2024-11-27 17:36:25.049 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl_ghtest1'. 2024-11-27 17:36:25.049 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl_hdsvc'. 2024-11-27 17:36:25.049 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl_hdtest1'. 2024-11-27 17:36:25.050 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl_hdtest2'. 2024-11-27 17:36:25.050 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl_mdsvc'. 2024-11-27 17:36:25.050 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl_pdsvc'. 从二节点的集群日志中可以看出,在集群通讯方面二节点在17点35分17秒检测到一节点无响应,随后发起多次检测超时后发起了对一节点的驱逐。 至此,我们可以发现由于一节点自身(磁盘存储)资源检测异常,集群整体状态出现问题,二节点尝试多次与一节点通讯,无果后二节点发起了对一节点的驱逐重启。 接下来我们需要进一步定位当时主机资源特别是存储磁盘资源的情况,参考OSW实时资源跟踪,vmstat日志: zzz <11/27/2024 17:29:51> Count:596 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 16 21 25165820 1295912 15468 20386028 0 0 0 12 0 0 4 2 94 0 0 11 24 25165820 1051788 14616 20369332 399 57 1788 1401 108391 149853 6 6 77 11 0 zzz <11/27/2024 17:29:56> Count:597 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 7 24 25165820 1051772 14616 20369332 0 0 0 12 0 0 4 2 94 0 0 5 24 25165820 1048716 11212 20332012 46 10 1188 1646 154885 225450 1 4 79 6 0 zzz <11/27/2024 17:30:09> Count:598 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 11 25165820 1048732 5824 20269800 0 0 0 12 0 0 4 2 94 0 0 2 3 25165820 1048984 4320 20226432 1088 203 7271 2862 337276 543128 1 3 90 7 0 zzz <11/27/2024 17:31:51> Count:599 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 3 40 25165820 1062400 5736 20260464 0 0 0 12 0 0 4 2 94 0 0 4 65 25165820 1048532 8260 20267880 258 42 6561 494 50303 71478 1 3 48 48 0 zzz <11/27/2024 17:32:58> Count:600 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 3 31 25165820 1054632 2812 20224560 0 0 0 12 0 0 4 2 94 0 0 1 32 25165820 1052064 2624 20220740 13 2 578 3 12474 18985 0 2 63 35 0 从vmstat日志我们可以看到17点29分开始陆续出现B队列等待,17点31分堵塞达到了惊人的65,同时伴随着大量IOWAIT等待。 top - 17:28:19 up 305 days, 7:58, 1 user, load average: 50.87, 21.70, 13.98 Tasks: 9906 total, 4 running, 9901 sleeping, 1 stopped, 0 zombie Cpu(s): 1.3%us, 3.9%sy, 0.0%ni, 69.3%id, 25.4%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 264153140k total, 263096780k used, 1056360k free, 11624k buffers Swap: 25165820k total, 25165820k used, 0k free, 20285488k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14243 oracle 20 0 60.3g 55m 42m S 71.8 0.0 407:35.14 oracle_14243_gh 20116 grid RT 0 2447m 210m 87m S 42.8 0.1 93360:25 ocssd.bin top - 17:28:43 up 305 days, 7:59, 1 user, load average: 101.02, 37.48, 19.55 Tasks: 9920 total, 8 running, 9910 sleeping, 1 stopped, 1 zombie Cpu(s): 1.0%us, 3.7%sy, 0.0%ni, 71.0%id, 24.3%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 264153140k total, 263104216k used, 1048924k free, 3840k buffers Swap: 25165820k total, 25165820k used, 0k free, 20250240k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14243 oracle 20 0 60.3g 55m 42m R 100.0 0.0 407:51.93 oracle_14243_gh 20116 grid RT 0 2447m 210m 87m S 66.1 0.1 93360:35 ocssd.bin 31850 nscd 20 0 1186m 1584 1048 S 48.1 0.0 7:12.23 nscd 34464 oracle 20 0 60.3g 507m 504m S 9.1 0.2 12086:19 ora_lck0_orcl1 34293 oracle 20 0 60.3g 1.9g 1.9g S 6.3 0.8 4094:29 ora_mmnl_orcl1 top - 17:29:13 up 305 days, 7:59, 1 user, load average: 109.97, 43.90, 22.05 Tasks: 9833 total, 13 running, 9819 sleeping, 1 stopped, 0 zombie Cpu(s): 3.7%us, 5.9%sy, 0.1%ni, 65.1%id, 25.1%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 264153140k total, 261737756k used, 2415384k free, 13532k buffers Swap: 25165820k total, 25165820k used, 0k free, 20379676k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14243 oracle 20 0 60.3g 55m 42m S 54.4 0.0 408:12.89 oracle_14243_gh 20116 grid RT 0 2447m 210m 87m S 30.0 0.1 93360:46 ocssd.bin 31850 nscd 20 0 1186m 1544 1008 S 22.9 0.0 7:21.05 nscd 277 root 20 0 0 0 0 R 7.8 0.0 121:10.21 kswapd1 124729 root 30 10 23144 11m 872 S 7.7 0.0 0:02.95 lsof 276 root 20 0 0 0 0 R 6.6 0.0 687:19.19 kswapd0 top - 17:30:29 up 305 days, 8:00, 1 user, load average: 115.15, 57.27, 28.25 Tasks: 10048 total, 1 running, 10046 sleeping, 1 stopped, 0 zombie Cpu(s): 0.9%us, 2.6%sy, 0.0%ni, 78.4%id, 18.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 264153140k total, 263104216k used, 1048924k free, 10220k buffers Swap: 25165820k total, 25165820k used, 0k free, 20283224k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20116 grid RT 0 2447m 210m 87m S 70.5 0.1 93361:16 ocssd.bin 34464 oracle 20 0 60.3g 507m 504m S 8.6 0.2 12086:24 ora_lck0_orcl1 80675 oracle -2 0 80.2g 48m 19m S 5.8 0.0 17204:57 ora_lms1_jtcdb1 80666 oracle -2 0 80.2g 53m 18m S 5.7 0.0 19828:09 ora_lms0_jtcdb1 80686 oracle -2 0 80.2g 56m 19m S 5.5 0.0 20126:18 ora_lms2_jtcdb1 80694 oracle -2 0 80.2g 45m 18m S 5.5 0.0 19078:46 ora_lms3_jtcdb1 top - 17:31:43 up 305 days, 8:02, 1 user, load average: 169.11, 87.11, 40.94 Tasks: 10083 total, 1 running, 10078 sleeping, 1 stopped, 3 zombie Cpu(s): 0.9%us, 2.7%sy, 0.0%ni, 85.5%id, 10.9%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 264153140k total, 263103900k used, 1049240k free, 6188k buffers Swap: 25165820k total, 25165820k used, 0k free, 20286868k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20116 grid RT 0 2447m 210m 87m S 30.0 0.1 93361:41 ocssd.bin 34464 oracle 20 0 60.3g 507m 504m S 3.9 0.2 12086:28 ora_lck0_orcl1 18961 grid 20 0 687m 22m 15m S 3.1 0.0 9414:41 gipcd.bin 16841 root 20 0 1838m 50m 19m S 2.7 0.0 1439:35 ohasd.bin 20098 grid 20 0 1128m 19m 12m S 2.6 0.0 790:58.34 diskmon 1541 oracle 20 0 60.3g 32m 29m S 0.9 0.0 0:01.07 oracle_1541_ghc 1651 oracle 20 0 60.3g 24m 21m S 0.9 0.0 0:00.84 oracle_1651_ghc 1703 oracle 20 0 60.3g 32m 29m D 0.9 0.0 0:00.74 oracle_1703_ghc 1727 oracle 20 0 106m 1812 1232 S 0.9 0.0 0:00.82 emCCR 28871 oracle 20 0 80.2g 28m 25m S 0.9 0.0 28:29.88 ora_p003_jtcdb1 top - 17:31:50 up 305 days, 8:02, 1 user, load average: 169.21, 89.84, 42.32 Tasks: 10112 total, 5 running, 10107 sleeping, 0 stopped, 0 zombie Cpu(s): 1.2%us, 4.0%sy, 0.1%ni, 70.5%id, 24.2%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 264153140k total, 263099340k used, 1053800k free, 6924k buffers Swap: 25165820k total, 25165820k used, 0k free, 20278328k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20116 grid RT 0 2447m 210m 87m S 52.0 0.1 93361:44 ocssd.bin 18961 grid 20 0 687m 23m 15m S 10.9 0.0 9414:41 gipcd.bin top - 17:32:24 up 305 days, 8:02, 1 user, load average: 184.37, 101.11, 47.59 Tasks: 10128 total, 3 running, 10120 sleeping, 1 stopped, 4 zombie Cpu(s): 0.7%us, 2.6%sy, 0.0%ni, 47.1%id, 49.6%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 264153140k total, 263095852k used, 1057288k free, 6780k buffers Swap: 25165820k total, 25165820k used, 0k free, 20262628k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20116 grid RT 0 2447m 210m 87m S 38.7 0.1 93361:57 ocssd.bin 34464 oracle 20 0 60.3g 507m 504m S 4.9 0.2 12086:30 ora_lck0_orcl1 18961 grid 20 0 687m 23m 15m S 4.2 0.0 9414:43 gipcd.bin 17110 root 30 10 1269m 76m 1320 S 3.4 0.0 2:34.80 perl 107051 root 20 0 1064m 211m 2136 T 3.3 0.1 16819:31 titanagent top - 17:32:40 up 305 days, 8:03, 2 users, load average: 178.61, 105.18, 50.07 Tasks: 10119 total, 2 running, 10115 sleeping, 1 stopped, 1 zombie Cpu(s): 0.5%us, 2.5%sy, 0.1%ni, 41.3%id, 55.5%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 264153140k total, 263099960k used, 1053180k free, 7852k buffers Swap: 25165820k total, 25165820k used, 0k free, 20246324k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20116 grid RT 0 2447m 210m 87m S 30.2 0.1 93362:03 ocssd.bin 108179 oracle 20 0 60.3g 2.1g 2.1g D 7.7 0.8 1:06.72 oracle_108179_g 107051 root 20 0 1064m 212m 2820 T 7.5 0.1 16819:33 titanagent 20098 grid 20 0 1128m 19m 12m S 7.1 0.0 791:00.15 diskmon 123551 root 30 10 20932 8768 860 R 6.2 0.0 0:08.95 top top - 17:33:13 up 305 days, 8:03, 1 user, load average: 186.23, 115.36, 55.52 Tasks: 10139 total, 1 running, 10137 sleeping, 1 stopped, 0 zombie Cpu(s): 0.6%us, 2.7%sy, 0.0%ni, 37.9%id, 58.8%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 264153140k total, 263106456k used, 1046684k free, 2376k buffers Swap: 25165820k total, 25165820k used, 0k free, 20216232k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20116 grid RT 0 2447m 210m 87m S 37.7 0.1 93362:17 ocssd.bin 3251 root 20 0 6063m 203m 1468 S 5.4 0.1 1801:42 java 34464 oracle 20 0 60.3g 507m 504m S 4.7 0.2 12086:32 ora_lck0_orcl1 18961 grid 20 0 687m 23m 15m S 4.1 0.0 9414:44 gi 同样我们从TOP日志我们可以看到17点28分开始陆续出现IOWAIT,17点33更是接近60%。 由此我们可以看出这次一节点重启资源检测异常的原因由于磁盘访问存在大量堵塞或响应慢超过了资源检测的最大容忍程度最终导致集群被逐出重启。 备注:查看二节点故障时按段资源不存在任何IO相关的堵塞。
Oracle 12.1 RAC 一节点重启排查
来源:这里教程网
时间:2026-03-03 20:56:56
作者:
编辑推荐:
下一篇:
相关推荐
-
雷神推出 MIX PRO II 迷你主机:基于 Ultra 200H,玻璃上盖 + ARGB 灯效
2 月 9 日消息,雷神 (THUNDEROBOT) 现已宣布推出基于英
-
制造商 Musnap 推出彩色墨水屏电纸书 Ocean C:支持手写笔、第三方安卓应用
2 月 10 日消息,制造商 Musnap 现已在海外推出一款 Oce
热文推荐
- delete 删除2600万数据
delete 删除2600万数据
26-03-03 - Oracle数据恢复—底层解析数据文件恢复Oracle数据库数据的案例
Oracle数据恢复—底层解析数据文件恢复Oracle数据库数据的案例
26-03-03 - 数据库管理-第266期 MOS改版,该如何登录(20241126)
数据库管理-第266期 MOS改版,该如何登录(20241126)
26-03-03 - 全 网 第 一 份JSON二元性“写操作”的实践
全 网 第 一 份JSON二元性“写操作”的实践
26-03-03 - 第15期Oracle调用DBMS_JOB.SUBMIT报错:ORA-27486:权限不足
- Oracle数据库 Truncate慢分析
Oracle数据库 Truncate慢分析
26-03-03 - 长沙家具宝藏店法拉利沙发,奢华优雅的家居典范
长沙家具宝藏店法拉利沙发,奢华优雅的家居典范
26-03-03 - oracle数据恢复—Oracle数据库文件大小变为0kb怎么恢复数据?
oracle数据恢复—Oracle数据库文件大小变为0kb怎么恢复数据?
26-03-03 - 一万字,浅谈Oracle数据库truncate原理,无备份,如何恢复!
一万字,浅谈Oracle数据库truncate原理,无备份,如何恢复!
26-03-03 - 湖南家具铂金钻奢石餐桌,简约与优雅的完美融合
湖南家具铂金钻奢石餐桌,简约与优雅的完美融合
26-03-03
