Oracle 12.1 RAC 一节点重启排查

来源:这里教程网 时间:2026-03-03 20:56:56 作者:

主机重启排查: 查看一节点操作系统日志/var/log/message,操作系统在17点36分有重启迹象: Nov 27 17:33:57 abcdb01 kernel: martian source 172.16.14.253 from 0.0.0.0, on dev bondeth0 Nov 27 17:33:57 abcdb01 kernel: ll header: ff:ff:ff:ff:ff:ff:00:00:00:00:fe:00:08:00 Nov 27 17:33:57 abcdb01 kernel: martian source 172.16.18.253 from 0.0.0.0, on dev eth0 Nov 27 17:33:57 abcdb01 kernel: ll header: ff:ff:ff:ff:ff:ff:00:00:00:00:fe:00:08:00 Nov 27 17:34:02 abcdb01 kernel: net_ratelimit: 173 callbacks suppressed Nov 27 17:34:02 abcdb01 kernel: martian source 172.16.14.253 from 0.0.0.0, on dev bondeth0 Nov 27 17:34:02 abcdb01 kernel: ll header: ff:ff:ff:ff:ff:ff:00:00:00:00:fe:00:08:00 Nov 27 17:34:02 abcdb01 kernel: martian source 172.16.14.0 from 0.0.0.0, on dev bondeth0 Nov 27 17:34:02 abcdb01 kernel: ll header: ff:ff:ff:Nov 27 17:36:10 abcdb01 kernel: imklog 5.8.10, log source = /proc/kmsg started. Nov 27 17:36:10 abcdb01 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="6713" x-info="http://www.rsyslog.com"] start Nov 27 17:36:10 abcdb01 kernel: Initializing cgroup subsys cpuset Nov 27 17:36:10 abcdb01 kernel: Initializing cgroup subsys cpu Nov 27 17:36:10 abcdb01 kernel: Linux version 2.6.39-400.248.3.el6uek.x86_64 (mockbuild@ca-build44.us.oracle.com) (gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC) ) #1 SMP Wed Mar 11 18:04:34 PDT 2015 Nov 27 17:36:10 abcdb01 kernel: Command line: root=LABEL=DBSYS  bootarea=dbsys bootfrom=BOOT ro loglevel=7 panic=60 debug pci=noaer log_buf_len=1m nmi_watchdog=0 nomce transparent_hugepage=never rd_NO_PLYMOUTH audit=1 console=tty1 conso le=ttyS0,115200n8 crashkernel=380M@128M numa=on 继续确认重启的时间通过命令last: root     pts/0        172.16.13.22     Wed Nov 27 17:44 - 17:44  (00:00)     root     pts/1        172.16.13.22     Wed Nov 27 17:43 - 17:44  (00:00)     root     pts/0        172.16.13.22     Wed Nov 27 17:43 - 17:44  (00:00)     reboot   system boot  2.6.39-400.248.3 Wed Nov 27 17:35 - 11:28  (17:53)     root     pts/2        172.16.13.22     Wed Nov 27 17:32 - 17:33  (00:00)     root     pts/5        172.16.13.22     Wed Nov 27 17:27 - 17:27  (00:00)     root     pts/2        172.16.13.22     Wed Nov 27 17:26 - 17:27  (00:01)         从以上last输出日志可以看出重启时间发生在17点35分 进一步排查一节点数据库日志: Wed Nov 27 17:29:52 2024 TT01: Standby redo logfile selected for thread 1 sequence 356065 for destination LOG_ARCHIVE_DEST_10 Wed Nov 27 17:30:07 2024 Global transaction identifier hash collision detected.  in_gti: GDPDB.a5117bd.193.33.346308p dx_gti: UPSPDB.353b8c41.88.6.459703pWed Nov 27 17:31:07 2024 Global transaction identifier hash collision detected. Wed Nov 27 17:31:08 2024 Fatal NI connect error 12170. Fatal NI connect error 12170. ………… Fatal NI connect error 12170.   Tns error struct:     ns main err code: 12535   Time: 27-NOV-2024 17:31:38   Tns error struct:     ns main err code: 12535   Tns error struct:     ns main err code: 12535 TNS-12535: TNS:operation timed out     ns secondary err code: 12606     ns secondary err code: 12606     ns secondary err code: 12606     ns secondary err code: 12606     nt main err code: 0     ns secondary err code: 12606     nt main err code: 0     nt main err code: 0     nt main err code: 0     nt main err code: 0 TNS-12535: TNS:operation timed out     ns secondary err code: 12606     nt main err code: 0     nt secondary err code: 0     nt OS err code: 0   Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.14.3)(PORT=23409)) Wed Nov 27 17:34:19 2024 WARNING: inbound connection timed out (ORA-3136) Wed Nov 27 17:41:52 2024 Adjusting the default value of parameter inmemory_max_populate_servers  from 36 to 30 based on PGA parameters Starting ORACLE instance (normal) (OS id: 30803) 由于类似日志较多,截取部分,从1节点数据库日志可以看出实例在17点31分陆续出现通讯超时的告警,同时数据库在17点41分发生了实例重启。 进一步排查,集群alert告警日志信息: 2024-11-27 17:06:15.997 [OCTSSD(21408)]CRS-2409: The clock on host abcdb01 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode. 2024-11-27 17:29:07.430 [ORAROOTAGENT(48149)]CRS-5818: Aborted command 'check' for resource 'ora.storage'. Details at (:CRSAGF00113:) {0:15:2} in /u01/app/grid/diag/crs/abcdb01/crs/trace/ohasd_orarootagent_root.trc. 2024-11-27 17:31:39.031 [ORAROOTAGENT(48149)]CRS-5818: Aborted command 'check' for resource 'ora.drivers.acfs'. Details at (:CRSAGF00113:) {0:15:2} in /u01/app/grid/diag/crs/abcdb01/crs/trace/ohasd_orarootagent_root.trc. 2024-11-27 17:31:44.987 [ORAROOTAGENT(48149)]CRS-5818: Aborted command 'check' for resource 'ora.storage'. Details at (:CRSAGF00113:) {0:15:2} in /u01/app/grid/diag/crs/abcdb01/crs/trace/ohasd_orarootagent_root.trc. 2024-11-27 17:32:07.371 [ORAROOTAGENT(48149)]CRS-5014: Agent "ORAROOTAGENT" timed out starting process "/u01/app/12.1.0.2/grid/bin/acfsload" for action "check": details at "(:CLSN00009:)" in "/u01/app/grid/diag/crs/abcdb01/crs/trace/oha sd_orarootagent_root.trc" 2024-11-27 17:32:54.168 [ORAROOTAGENT(48149)]CRS-5832: Agent '/u01/app/12.1.0.2/grid/bin/orarootagent_root' was unable to process commands. Details at (:CRSAGF00128:) {0:0:3325} in /u01/app/grid/diag/crs/abcdb01/crs/trace/ohasd_oraroota gent_root.trc. 2024-11-27 17:40:07.258 [OHASD(17067)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 17067 2024-11-27 17:40:07.289 [OHASD(17067)]CRS-0714: Oracle Clusterware Release 12.1.0.2.0. 2024-11-27 17:40:07.312 [OHASD(17067)]CRS-2112: The OLR service started on node abcdb01. 从集群的告警日志中可以看出17点29分跟存储磁盘相关的资源检测异常,17点40分集群重新启动。 进一步下钻排查/u01/app/grid/diag/crs/abcdb01/crs/trace/ohasd_orarootagent_root.trc日志信息: 2024-11-27 17:29:06.580783 :CLSDYNAM:3361801984: [ora.diskmon]{0:15:2} [check] DiskmonAgent::check { ………… 2024-11-27 17:29:13.472582 :    AGFW:3370206976: {0:15:2} Command: check for resource: ora.storage 1 1 completed with status: TIMEDOUT 2024-11-27 17:29:13.491383 :    AGFW:3368105728: {0:15:2} ora.storage 1 1 state changed from: ONLINE to: UNKNOWN 2024-11-27 17:29:13.491443 :    AGFW:3368105728: {0:15:2} ora.storage 1 1 would be continued to monitored! 2024-11-27 17:29:13.491456 :    AGFW:3368105728: {0:15:2} Switching offline monitor to intermedite one 2024-11-27 17:29:13.499020 :    AGFW:3368105728: {0:15:2} Started implicit monitor for [ora.storage 1 1] interval=60000 delay=60000 2024-11-27 17:29:13.499034 :    AGFW:3368105728: {0:15:2} Resource ora.storage 1 1 has transitioned to INTERMEDIATE 2024-11-27 17:29:13.499068 :    AGFW:3368105728: {0:15:2} Started timer for [ora.storage 1 1]  delay=600000 2024-11-27 17:29:13.499090 :    AGFW:3368105728: {0:15:2} ora.storage 1 1 state details has changed from:  to: CHECK TIMED OUT 2024-11-27 17:29:13.499104 :    AGFW:3368105728: {0:15:5} Generating new Tint for unplanned state change. Original Tint: {0:15:2} ………… 2024-11-27 17:32:11.054930 :    AGFW:3368105728: {0:15:6} State = 1 2024-11-27 17:32:11.054935 :    AGFW:3368105728: {0:15:6} Check = 0 2024-11-27 17:32:11.054966 :    AGFW:3368105728: {0:15:6} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:17310408 2024-11-27 17:32:13.805676 :CLSDYNAM:3359700736: [ora.diskmon]{0:15:2} [check] DiskmonAgent::check { 2024-11-27 17:32:13.805766 :CLSDYNAM:3359700736: [ora.diskmon]{0:15:2} [check] DiskmonAgent::check } - 0 2024-11-27 17:32:14.399974 : USRTHRD:2604566272:  HAIP: event GIPCD_METRIC_UPDATE 2024-11-27 17:32:14.400046 : USRTHRD:2606667520: {0:15:2} dequeue change event 0x7fab783ede40, GIPCD_METRIC_UPDATE 2024-11-27 17:32:14.400093 : USRTHRD:2606667520: {0:15:2} InitializeHaIps[ 1]  infList 'inf ib1, ip 192.168.10.2, sub 192.168.8.0' 2024-11-27 17:32:14.400102 : USRTHRD:2606667520: {0:15:2} InitializeHaIps[ 0]  infList 'inf ib0, ip 192.168.10.1, sub 192.168.8.0' 2024-11-27 17:32:14.400108 : USRTHRD:2606667520: {0:15:2} HAIP: updateCssMbrData for ib0:89;ib1:88 2024-11-27 17:32:14.400120 : USRTHRD:2606667520: {0:15:2} HAIP: enter UpdatecssMbrdata 2024-11-27 17:32:14.400142 : USRTHRD:2606667520: {0:15:2} HAIP:  to Update member info HAIP1;192.168.8.0#0;192.168.8.0#1 2024-11-27 17:32:14.400174 : USRTHRD:2606667520: {0:15:2} to verify routes 2024-11-27 17:32:14.400182 : USRTHRD:2606667520: {0:15:2} to verify start completion 2 2024-11-27 17:32:49.811980 :CLSDYNAM:2617243392: [ora.diskmon]{0:15:2} [check] DiskmonAgent::check } - 0 2024-11-27 17:32:52.812672 :CLSDYNAM:3359700736: [ora.diskmon]{0:15:2} [check] DiskmonAgent::check { 2024-11-27 17:32:52.812787 :CLSDYNAM:3359700736: [ora.diskmon]{0:15:2} [check] DiskmonAgent::check } - 0 2024-11-27 17:32:54.155776 :    AGFW:3368105728: {0:0:3325} Agent received the message: AGENT_HB[Engine] ID 12293:10097628 2024-11-27 17:32:54.164370 :    AGFW:3368105728: {0:0:3325} Command: check for resource: ora.storage 1 1 has excedeed the maximum allowed time limit! 2024-11-27 17:32:54.177730 :   AGENT:3368105728: {0:0:3325} {0:0:3325} Created alert : (:CRSAGF00128:) :  Command executor module is unable to process the commands. 2024-11-27 17:32:54.181205 :   AGENT:3368105728: {0:0:3325} Agfw calling user exitCB, will exit on return 2024-11-27 17:32:54.189009 :   AGENT:3368105728: {0:0:3325} returned from user exitCB, exiting 2024-11-27 17:32:54.189044 :    AGFW:3368105728: {0:0:3325} Agent is exiting with exit code: 1 2024-11-27 17:33:01.596879 : USRTHRD:2602465024: {0:15:2} mbr 1 changes mbr private data 2024-11-27 17:33:01.597385 : USRTHRD:2602465024: {0:15:2} enqueue private data change event 0x7fab7ce17e00, CssMbrData, HAIP1;192.168.8.0#0;192.168.8.0#1 2024-11-27 17:33:01.597403 : USRTHRD:2602465024: {0:15:2} cssEventThread: it is 6 event 2024-11-27 17:33:01.597437 : USRTHRD:2606667520: {0:15:2} dequeue change event 0x7fab7ce17e00, CssMbrData ………… 2024-11-27 17:33:01.597665 : USRTHRD:2606667520: {0:15:2} HAIP: hag size 2 2024-11-27 17:33:01.597671 : USRTHRD:2606667520: {0:15:2} HAIP: i 0, haip 0, hag size 2, ip 192.168.10.1 2024-11-27 17:33:01.597678 : USRTHRD:2606667520: {0:15:2} HAIP: i 1, haip 1, hag size 2, ip 192.168.10.2 2024-11-27 17:33:01.597684 : USRTHRD:2606667520: {0:15:2} to verify routes 2024-11-27 17:33:01.597693 : USRTHRD:2606667520: {0:15:2} to verify start completion 2 2024-11-27 17:33:01.599070 : USRTHRD:2606667520: {0:15:2} HAIP: assigned ip '169.254.145.189' 2024-11-27 17:33:01.599077 : USRTHRD:2606667520: {0:15:2} HAIP: check ip '169.254.145.189' 2024-11-27 17:33:01.600240 : USRTHRD:2606667520: {0:15:2} HAIP: assigned ip '169.254.53.70' 2024-11-27 17:33:01.600247 : USRTHRD:2606667520: {0:15:2} HAIP: check ip '169.254.53.70' 2024-11-27 17:33:01.600252 : USRTHRD:2606667520: {0:15:2} Start: 2 HAIP assignment, 2, 1 Trace file /u01/app/grid/diag/crs/abcdb01/crs/trace/ohasd_orarootagent_root.trc Oracle Database 12c Clusterware Release 12.1.0.2.0 - Production Copyright 1996, 2014 Oracle. All rights reserved.     CLSB:30690816: Argument count (argc) for this daemon is 1     CLSB:30690816: Argument 0 is: /u01/app/12.1.0.2/grid/bin/orarootagent.bin 从ohasd_orarootagent_root日志中可以详细看出,集群在17点29分资源检测异常,持续监测到17点33分超过最大允许时间后,伴随着17点40分集群重新启动。 进一步排查二节点集群日志: 2024-11-27 17:35:17.597 [OCSSD(17237)]CRS-1612: Network communication with node abcdb01 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 29.310 seconds 2024-11-27 17:35:32.599 [OCSSD(17237)]CRS-1611: Network communication with node abcdb01 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 14.310 seconds 2024-11-27 17:35:41.600 [OCSSD(17237)]CRS-1610: Network communication with node abcdb01 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 5.310 seconds 2024-11-27 17:35:45.124 [OHASD(16344)]CRS-8011: reboot advisory message from host: abcdb01, component: cssagent, with time stamp: L-2024-11-27-17:35:45.087 2024-11-27 17:35:45.129 [OHASD(16344)]CRS-8013: reboot advisory message text: Rebooting HUB node after limit 58180 exceeded; disk timeout 58180, network timeout 57620, last heartbeat from CSSD at epoch seconds 1732700086.866, 58210 milli seconds ago based on invariant clock value of 606340488 2024-11-27 17:35:45.129 [OHASD(16344)]CRS-8011: reboot advisory message from host: abcdb01, component: cssmonit, with time stamp: L-2024-11-27-17:35:45.092 2024-11-27 17:35:45.129 [OHASD(16344)]CRS-8013: reboot advisory message text: Rebooting HUB node after limit 58180 exceeded; disk timeout 58180, network timeout 57620, last heartbeat from CSSD at epoch seconds 1732700086.866, 58211 milli seconds ago based on invariant clock value of 606340487 2024-11-27 17:35:46.911 [OCSSD(17237)]CRS-1632: Node abcdb01 is being removed from the cluster in cluster incarnation 345421125 2024-11-27 17:35:47.027 [OCSSD(17237)]CRS-1601: CSSD Reconfiguration complete. Active nodes are abcdb02 . 2024-11-27 17:35:47.085 [CRSD(20928)]CRS-5504: Node down event reported for node 'abcdb01'. 2024-11-27 17:36:25.047 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'Generic'. 2024-11-27 17:36:25.048 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.dbm'. 2024-11-27 17:36:25.049 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl'. 2024-11-27 17:36:25.049 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl_gdsvc'. 2024-11-27 17:36:25.049 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl_ghtest1'. 2024-11-27 17:36:25.049 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl_hdsvc'. 2024-11-27 17:36:25.049 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl_hdtest1'. 2024-11-27 17:36:25.050 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl_hdtest2'. 2024-11-27 17:36:25.050 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl_mdsvc'. 2024-11-27 17:36:25.050 [CRSD(20928)]CRS-2773: Server 'abcdb01' has been removed from pool 'ora.orcl_pdsvc'. 从二节点的集群日志中可以看出,在集群通讯方面二节点在17点35分17秒检测到一节点无响应,随后发起多次检测超时后发起了对一节点的驱逐。 至此,我们可以发现由于一节点自身(磁盘存储)资源检测异常,集群整体状态出现问题,二节点尝试多次与一节点通讯,无果后二节点发起了对一节点的驱逐重启。 接下来我们需要进一步定位当时主机资源特别是存储磁盘资源的情况,参考OSW实时资源跟踪,vmstat日志: zzz <11/27/2024 17:29:51> Count:596 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st 16 21 25165820 1295912  15468 20386028    0    0     0    12    0    0  4  2 94  0  0 11 24 25165820 1051788  14616 20369332  399   57  1788  1401 108391 149853  6  6 77 11  0 zzz <11/27/2024 17:29:56> Count:597 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st  7 24 25165820 1051772  14616 20369332    0    0     0    12    0    0  4  2 94  0  0  5 24 25165820 1048716  11212 20332012   46   10  1188  1646 154885 225450  1  4 79 6  0 zzz <11/27/2024 17:30:09> Count:598 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st  1 11 25165820 1048732   5824 20269800    0    0     0    12    0    0  4  2 94  0  0  2  3 25165820 1048984   4320 20226432 1088  203  7271  2862 337276 543128  1  3 90  7  0 zzz <11/27/2024 17:31:51> Count:599 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st  3 40 25165820 1062400   5736 20260464    0    0     0    12    0    0  4  2 94  0  0  4 65 25165820 1048532   8260 20267880  258   42  6561   494 50303 71478  1  3 48 48  0 zzz <11/27/2024 17:32:58> Count:600 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st  3 31 25165820 1054632   2812 20224560    0    0     0    12    0    0  4  2 94  0  0  1 32 25165820 1052064   2624 20220740   13    2   578     3 12474 18985  0  2 63 35  0 从vmstat日志我们可以看到17点29分开始陆续出现B队列等待,17点31分堵塞达到了惊人的65,同时伴随着大量IOWAIT等待。 top - 17:28:19 up 305 days,  7:58,  1 user,  load average: 50.87, 21.70, 13.98 Tasks: 9906 total,   4 running, 9901 sleeping,   1 stopped,   0 zombie Cpu(s):  1.3%us,  3.9%sy,  0.0%ni, 69.3%id, 25.4%wa,  0.0%hi,  0.0%si,  0.0%st Mem:  264153140k total, 263096780k used,  1056360k free,    11624k buffers Swap: 25165820k total, 25165820k used,        0k free, 20285488k cached    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  14243 oracle    20   0 60.3g  55m  42m S 71.8  0.0 407:35.14 oracle_14243_gh  20116 grid      RT   0 2447m 210m  87m S 42.8  0.1  93360:25 ocssd.bin top - 17:28:43 up 305 days,  7:59,  1 user,  load average: 101.02, 37.48, 19.55 Tasks: 9920 total,   8 running, 9910 sleeping,   1 stopped,   1 zombie Cpu(s):  1.0%us,  3.7%sy,  0.0%ni, 71.0%id, 24.3%wa,  0.0%hi,  0.0%si,  0.0%st Mem:  264153140k total, 263104216k used,  1048924k free,     3840k buffers Swap: 25165820k total, 25165820k used,        0k free, 20250240k cached    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  14243 oracle    20   0 60.3g  55m  42m R 100.0  0.0 407:51.93 oracle_14243_gh  20116 grid      RT   0 2447m 210m  87m S 66.1  0.1  93360:35 ocssd.bin  31850 nscd      20   0 1186m 1584 1048 S 48.1  0.0   7:12.23 nscd  34464 oracle    20   0 60.3g 507m 504m S  9.1  0.2  12086:19 ora_lck0_orcl1  34293 oracle    20   0 60.3g 1.9g 1.9g S  6.3  0.8   4094:29 ora_mmnl_orcl1 top - 17:29:13 up 305 days,  7:59,  1 user,  load average: 109.97, 43.90, 22.05 Tasks: 9833 total,  13 running, 9819 sleeping,   1 stopped,   0 zombie Cpu(s):  3.7%us,  5.9%sy,  0.1%ni, 65.1%id, 25.1%wa,  0.0%hi,  0.1%si,  0.0%st Mem:  264153140k total, 261737756k used,  2415384k free,    13532k buffers Swap: 25165820k total, 25165820k used,        0k free, 20379676k cached    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  14243 oracle    20   0 60.3g  55m  42m S 54.4  0.0 408:12.89 oracle_14243_gh  20116 grid      RT   0 2447m 210m  87m S 30.0  0.1  93360:46 ocssd.bin  31850 nscd      20   0 1186m 1544 1008 S 22.9  0.0   7:21.05 nscd    277 root      20   0     0    0    0 R  7.8  0.0 121:10.21 kswapd1 124729 root      30  10 23144  11m  872 S  7.7  0.0   0:02.95 lsof    276 root      20   0     0    0    0 R  6.6  0.0 687:19.19 kswapd0  top - 17:30:29 up 305 days,  8:00,  1 user,  load average: 115.15, 57.27, 28.25 Tasks: 10048 total,   1 running, 10046 sleeping,   1 stopped,   0 zombie Cpu(s):  0.9%us,  2.6%sy,  0.0%ni, 78.4%id, 18.1%wa,  0.0%hi,  0.0%si,  0.0%st Mem:  264153140k total, 263104216k used,  1048924k free,    10220k buffers Swap: 25165820k total, 25165820k used,        0k free, 20283224k cached    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  20116 grid      RT   0 2447m 210m  87m S 70.5  0.1  93361:16 ocssd.bin  34464 oracle    20   0 60.3g 507m 504m S  8.6  0.2  12086:24 ora_lck0_orcl1  80675 oracle    -2   0 80.2g  48m  19m S  5.8  0.0  17204:57 ora_lms1_jtcdb1  80666 oracle    -2   0 80.2g  53m  18m S  5.7  0.0  19828:09 ora_lms0_jtcdb1  80686 oracle    -2   0 80.2g  56m  19m S  5.5  0.0  20126:18 ora_lms2_jtcdb1  80694 oracle    -2   0 80.2g  45m  18m S  5.5  0.0  19078:46 ora_lms3_jtcdb1  top - 17:31:43 up 305 days,  8:02,  1 user,  load average: 169.11, 87.11, 40.94 Tasks: 10083 total,   1 running, 10078 sleeping,   1 stopped,   3 zombie Cpu(s):  0.9%us,  2.7%sy,  0.0%ni, 85.5%id, 10.9%wa,  0.0%hi,  0.0%si,  0.0%st Mem:  264153140k total, 263103900k used,  1049240k free,     6188k buffers Swap: 25165820k total, 25165820k used,        0k free, 20286868k cached    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  20116 grid      RT   0 2447m 210m  87m S 30.0  0.1  93361:41 ocssd.bin  34464 oracle    20   0 60.3g 507m 504m S  3.9  0.2  12086:28 ora_lck0_orcl1  18961 grid      20   0  687m  22m  15m S  3.1  0.0   9414:41 gipcd.bin  16841 root      20   0 1838m  50m  19m S  2.7  0.0   1439:35 ohasd.bin  20098 grid      20   0 1128m  19m  12m S  2.6  0.0 790:58.34 diskmon   1541 oracle    20   0 60.3g  32m  29m S  0.9  0.0   0:01.07 oracle_1541_ghc   1651 oracle    20   0 60.3g  24m  21m S  0.9  0.0   0:00.84 oracle_1651_ghc   1703 oracle    20   0 60.3g  32m  29m D  0.9  0.0   0:00.74 oracle_1703_ghc   1727 oracle    20   0  106m 1812 1232 S  0.9  0.0   0:00.82 emCCR  28871 oracle    20   0 80.2g  28m  25m S  0.9  0.0  28:29.88 ora_p003_jtcdb1    top - 17:31:50 up 305 days,  8:02,  1 user,  load average: 169.21, 89.84, 42.32 Tasks: 10112 total,   5 running, 10107 sleeping,   0 stopped,   0 zombie Cpu(s):  1.2%us,  4.0%sy,  0.1%ni, 70.5%id, 24.2%wa,  0.0%hi,  0.0%si,  0.0%st Mem:  264153140k total, 263099340k used,  1053800k free,     6924k buffers Swap: 25165820k total, 25165820k used,        0k free, 20278328k cached    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  20116 grid      RT   0 2447m 210m  87m S 52.0  0.1  93361:44 ocssd.bin  18961 grid      20   0  687m  23m  15m S 10.9  0.0   9414:41 gipcd.bin   top - 17:32:24 up 305 days,  8:02,  1 user,  load average: 184.37, 101.11, 47.59 Tasks: 10128 total,   3 running, 10120 sleeping,   1 stopped,   4 zombie Cpu(s):  0.7%us,  2.6%sy,  0.0%ni, 47.1%id, 49.6%wa,  0.0%hi,  0.0%si,  0.0%st Mem:  264153140k total, 263095852k used,  1057288k free,     6780k buffers Swap: 25165820k total, 25165820k used,        0k free, 20262628k cached    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  20116 grid      RT   0 2447m 210m  87m S 38.7  0.1  93361:57 ocssd.bin  34464 oracle    20   0 60.3g 507m 504m S  4.9  0.2  12086:30 ora_lck0_orcl1  18961 grid      20   0  687m  23m  15m S  4.2  0.0   9414:43 gipcd.bin  17110 root      30  10 1269m  76m 1320 S  3.4  0.0   2:34.80 perl 107051 root      20   0 1064m 211m 2136 T  3.3  0.1  16819:31 titanagent   top - 17:32:40 up 305 days,  8:03,  2 users,  load average: 178.61, 105.18, 50.07 Tasks: 10119 total,   2 running, 10115 sleeping,   1 stopped,   1 zombie Cpu(s):  0.5%us,  2.5%sy,  0.1%ni, 41.3%id, 55.5%wa,  0.0%hi,  0.0%si,  0.0%st Mem:  264153140k total, 263099960k used,  1053180k free,     7852k buffers Swap: 25165820k total, 25165820k used,        0k free, 20246324k cached    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  20116 grid      RT   0 2447m 210m  87m S 30.2  0.1  93362:03 ocssd.bin 108179 oracle    20   0 60.3g 2.1g 2.1g D  7.7  0.8   1:06.72 oracle_108179_g 107051 root      20   0 1064m 212m 2820 T  7.5  0.1  16819:33 titanagent  20098 grid      20   0 1128m  19m  12m S  7.1  0.0 791:00.15 diskmon 123551 root      30  10 20932 8768  860 R  6.2  0.0   0:08.95 top        top - 17:33:13 up 305 days,  8:03,  1 user,  load average: 186.23, 115.36, 55.52 Tasks: 10139 total,   1 running, 10137 sleeping,   1 stopped,   0 zombie Cpu(s):  0.6%us,  2.7%sy,  0.0%ni, 37.9%id, 58.8%wa,  0.0%hi,  0.0%si,  0.0%st Mem:  264153140k total, 263106456k used,  1046684k free,     2376k buffers Swap: 25165820k total, 25165820k used,        0k free, 20216232k cached    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  20116 grid      RT   0 2447m 210m  87m S 37.7  0.1  93362:17 ocssd.bin   3251 root      20   0 6063m 203m 1468 S  5.4  0.1   1801:42 java  34464 oracle    20   0 60.3g 507m 504m S  4.7  0.2  12086:32 ora_lck0_orcl1  18961 grid      20   0  687m  23m  15m S  4.1  0.0   9414:44 gi 同样我们从TOP日志我们可以看到17点28分开始陆续出现IOWAIT,17点33更是接近60%。 由此我们可以看出这次一节点重启资源检测异常的原因由于磁盘访问存在大量堵塞或响应慢超过了资源检测的最大容忍程度最终导致集群被逐出重启。 备注:查看二节点故障时按段资源不存在任何IO相关的堵塞。

相关推荐