客户环境信息:
操作系统版本:rhel 6.9 X86_64
数据库版本:oracle rac 11.2.0.4
客户的疑问:链路宕为什么导致节点2宕机而节点1却没有受到影响
客户反馈的ASM报错信息:
2018-05-23 16:21:16.296: [ SKGFD][4072036096]Handle 0x7fffe009f230 from lib :UFS:: for disk :/dev/mapper/ssdredo_1:
2018-05-23 16:21:16.296: [ SKGFD][4072036096]Handle 0x7fffe009fe90 from lib :UFS:: for disk :/dev/mapper/data_1:
2018-05-23 16:21:16.296: [ SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512))
2018-05-23 16:21:16.296: [ SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe00993b0 for disk :/dev/mapper/data_2:
2018-05-23 16:21:16.296: [ SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [ SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe0099d50 for disk :/dev/mapper/data_3:
2018-05-23 16:21:16.296: [ SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [ SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009a970 for disk :/dev/mapper/fra_1:
2018-05-23 16:21:16.296: [ SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [ SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009b590 for disk :/dev/mapper/ocr_1:
2018-05-23 16:21:16.296: [ SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [ SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009c1b0 for disk :/dev/mapper/ocr_2:
2018-05-23 16:21:16.296: [ SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [ SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009cdd0 for disk :/dev/mapper/ocr_3:
2018-05-23 16:21:16.296: [ SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512))
2018-05-23 16:21:16.296: [ SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009d9f0 for disk :/dev/mapper/ssddata_1:
2018-05-23 16:21:16.296: [ SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512))
2018-05-23 16:21:16.296: [ SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009e610 for disk :/dev/mapper/ssddata_2:
2018-05-23 16:21:16.296: [ SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [ SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009f230 for disk :/dev/mapper/ssdredo_1:
2018-05-23 16:21:16.296: [ SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512))
本次客户反馈的oracle rac单节点宕机相关分析如下:
由以上ASM告警日志提示,需要观察故障前集群的操作系统日志是否有存储相关链路异常信息,节点1操作系统日志提示如下:
May 23 03:37:02 soadb1 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-to-date.
May 23 16:21:19 soadb1 kernel: lpfc 0000:c4:00.1: 1:1305 Link Down Event xa received Data: xa x20 x800110 x0 x0
May 23 16:21:23 soadb1 kernel: sd 2:0:0:1: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: sd 2:0:0:3: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: sd 2:0:0:4: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: sd 2:0:0:6: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: sd 2:0:0:7: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: sd 2:0:0:9: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 8:176.
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 8:208.
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 8:224.
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 65:0.
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 65:16.
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 65:208.
May 23 16:21:19 soadb1 kernel: lpfc 0000:c4:00.1: 1:1305 Link Down Event xa received Data: xa x20 x800110 x0 x0
May 23 16:21:23 soadb1 kernel: sd 2:0:0:1: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: sd 2:0:0:3: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: sd 2:0:0:4: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: sd 2:0:0:6: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: sd 2:0:0:7: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: sd 2:0:0:9: rejecting I/O to offline device
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 8:176.
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 8:208.
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 8:224.
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 65:0.
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 65:16.
May 23 16:21:23 soadb1 kernel: device-mapper: multipath: Failing path 65:208.
观察故障前,节点2的操作系统日志:
May 23 03:44:03 oadb2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-to-date.
May 23 16:17:45 oadb2 kernel: lpfc 0000:c4:00.1: 1:1305 Link Down Event x2 received Data: x2 x20 x800110 x0 x0
May 23 16:17:49 oadb2 multipathd: 8:176: mark as failed
May 23 16:17:49 oadb2 multipathd: data_2: Entering recovery mode: max_retries=6
May 23 16:17:49 oadb2 multipathd: data_2: remaining active paths: 0
May 23 16:17:49 oadb2 kernel: sd 2:0:0:1: rejecting I/O to offline device
May 23 16:17:49 oadb2 kernel: device-mapper: multipath: Failing path 8:176.
May 23 16:17:49 oadb2 kernel: sd 2:0:0:4: rejecting I/O to offline device
May 23 16:17:49 oadb2 kernel: sd 2:0:0:4: [sdo] killing request
May 23 16:17:49 oadb2 kernel: sd 2:0:0:5: rejecting I/O to offline device
May 23 16:17:49 oadb2 kernel: sd 2:0:0:5: [sdp] killing request
May 23 16:17:49 oadb2 kernel: sd 2:0:0:6: rejecting I/O to offline device
May 23 16:17:49 oadb2 kernel: sd 2:0:0:6: [sdq] killing request
May 23 16:17:49 oadb2 kernel: sd 2:0:0:4: [sdo]
May 23 16:17:49 oadb2 kernel: sd 2:0:0:5: [sdp] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 23 16:17:49 oadb2 kernel: sd 2:0:0:4: [sdo] CDB: Write(10): 2a 00 00 05 00 12 00 00 01 00
May 23 16:17:49 oadb2 kernel: end_request: I/O error, dev sdo, sector 327698
May 23 16:17:45 oadb2 kernel: lpfc 0000:c4:00.1: 1:1305 Link Down Event x2 received Data: x2 x20 x800110 x0 x0
May 23 16:17:49 oadb2 multipathd: 8:176: mark as failed
May 23 16:17:49 oadb2 multipathd: data_2: Entering recovery mode: max_retries=6
May 23 16:17:49 oadb2 multipathd: data_2: remaining active paths: 0
May 23 16:17:49 oadb2 kernel: sd 2:0:0:1: rejecting I/O to offline device
May 23 16:17:49 oadb2 kernel: device-mapper: multipath: Failing path 8:176.
May 23 16:17:49 oadb2 kernel: sd 2:0:0:4: rejecting I/O to offline device
May 23 16:17:49 oadb2 kernel: sd 2:0:0:4: [sdo] killing request
May 23 16:17:49 oadb2 kernel: sd 2:0:0:5: rejecting I/O to offline device
May 23 16:17:49 oadb2 kernel: sd 2:0:0:5: [sdp] killing request
May 23 16:17:49 oadb2 kernel: sd 2:0:0:6: rejecting I/O to offline device
May 23 16:17:49 oadb2 kernel: sd 2:0:0:6: [sdq] killing request
May 23 16:17:49 oadb2 kernel: sd 2:0:0:4: [sdo]
May 23 16:17:49 oadb2 kernel: sd 2:0:0:5: [sdp] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 23 16:17:49 oadb2 kernel: sd 2:0:0:4: [sdo] CDB: Write(10): 2a 00 00 05 00 12 00 00 01 00
May 23 16:17:49 oadb2 kernel: end_request: I/O error, dev sdo, sector 327698
结合操作系统日志和ASM告警日志,可判断本次农信oracle 节点2宕机的原因是存储链路宕,导致节点2的OCR出现IO异常而宕机,故障发生流程如下:
1、5月23 16:17:45节点2操作系统检测到链路宕,信息如下
May 23 16:17:45 oadb2 kernel: lpfc 0000:c4:00.1: 1:1305 Link Down Event x2 received Data: x2 x20 x800110 x0 x0
2、5月23 16:17:49 节点2操作系统日志检测到内核IO请求错误,位置:盘符sdo,扇区 327698
May 23 16:17:49 oadb2 kernel: end_request: I/O error, dev sdo, sector 327698
3、5月23 16:17:50 节点2操作系统日志检测到多路径映射失败
May 23 16:17:50 oadb2 kernel: device-mapper: multipath: Failing path 8:192.
May 23 16:17:50 oadb2 kernel: rport-2:0-8: blocked FC remote port time out: removing target and saving binding
4、5月23 16:17:51节点2操作系统日志提示多路径检测到无法完成flush操作
May 23 16:17:51 oadb2 multipathd: ocr_1: map in use
May 23 16:17:51 oadb2 multipathd: ocr_1: can't flush
May 23 16:17:51 oadb2 multipathd: ocr_1: load table [0 2097152 multipath 1 queue_if_no_path 0 0 0]
May 23 16:17:51 oadb2 multipathd: ocr_1: Entering recovery mode: max_retries=6
May 23 16:17:51 oadb2 multipathd: sdo [8:224]: path removed from map ocr_1
May 23 16:17:51 oadb2 multipathd: sdp: remove path (uevent)
May 23 16:17:51 oadb2 multipathd: ocr_2: map in use
May 23 16:17:51 oadb2 multipathd: ocr_2: can't flush
May 23 16:17:51 oadb2 multipathd: ocr_2: load table [0 2097152 multipath 1 queue_if_no_path 0 0 0]
May 23 16:17:51 oadb2 multipathd: ocr_2: Entering recovery mode: max_retries=6
May 23 16:17:51 oadb2 multipathd: sdp [8:240]: path removed from map ocr_2
May 23 16:17:51 oadb2 multipathd: sdq: remove path (uevent)
May 23 16:17:51 oadb2 multipathd: ocr_3: map in use
May 23 16:17:51 oadb2 multipathd: ocr_3: can't flush
May 23 16:17:51 oadb2 multipathd: ocr_3: load table [0 2097152 multipath 1 queue_if_no_path 0 0 0]
May 23 16:17:51 oadb2 multipathd: ocr_3: Entering recovery mode: max_retries=6
May 23 16:17:51 oadb2 multipathd: ocr_1: can't flush
May 23 16:17:51 oadb2 multipathd: ocr_1: load table [0 2097152 multipath 1 queue_if_no_path 0 0 0]
May 23 16:17:51 oadb2 multipathd: ocr_1: Entering recovery mode: max_retries=6
May 23 16:17:51 oadb2 multipathd: sdo [8:224]: path removed from map ocr_1
May 23 16:17:51 oadb2 multipathd: sdp: remove path (uevent)
May 23 16:17:51 oadb2 multipathd: ocr_2: map in use
May 23 16:17:51 oadb2 multipathd: ocr_2: can't flush
May 23 16:17:51 oadb2 multipathd: ocr_2: load table [0 2097152 multipath 1 queue_if_no_path 0 0 0]
May 23 16:17:51 oadb2 multipathd: ocr_2: Entering recovery mode: max_retries=6
May 23 16:17:51 oadb2 multipathd: sdp [8:240]: path removed from map ocr_2
May 23 16:17:51 oadb2 multipathd: sdq: remove path (uevent)
May 23 16:17:51 oadb2 multipathd: ocr_3: map in use
May 23 16:17:51 oadb2 multipathd: ocr_3: can't flush
May 23 16:17:51 oadb2 multipathd: ocr_3: load table [0 2097152 multipath 1 queue_if_no_path 0 0 0]
May 23 16:17:51 oadb2 multipathd: ocr_3: Entering recovery mode: max_retries=6
May 23 16:17:51 oadb2 multipathd: sdq [65:0]: path removed from map ocr_3
5、5月23 16:18节点2的操作系统多路径设置ocr相关的磁盘不可用
May 23 16:18:20 oadb2 multipathd: ocr_1: Disable queueing
May 23 16:18:20 oadb2 multipathd: ocr_2: Disable queueing
May 23 16:18:21 oadb2 multipathd: ocr_3: Disable queueing
May 23 16:18:20 oadb2 multipathd: ocr_2: Disable queueing
May 23 16:18:21 oadb2 multipathd: ocr_3: Disable queueing
6、5月 23 16:21节点2操作系统日志提示oracle rac集群核心进程ocssd进程出现异常
May 23 16:21:08 oadb2 abrt[52037]: Saved core dump of pid 27479 (/u01/app/11.2.0/grid/bin/ocssd.bin) to /var/spool/abrt/ccpp-2018-05-23-16:21:07-27479 (101912576 bytes)
May 23 16:21:08 oadb2 abrtd: Directory 'ccpp-2018-05-23-16:21:07-27479' creation detected
May 23 16:21:09 oadb2 abrtd: Executable '/u01/app/11.2.0/grid/bin/ocssd.bin' doesn't belong to any package and ProcessUnpackaged is set to 'no'
May 23 16:21:09 oadb2 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2018-05-23-16:21:07-27479' exited with 1
May 23 16:21:09 oadb2 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2018-05-23-16:21:07-27479'
May 23 16:21:11 oadb2 ntpd[8087]: Deleting interface #8 bond1:1, 169.254.211.6#123, interface stats: received=0, sent=0, dropped=0, active_time=3515469 secs
May 23 16:21:08 oadb2 abrtd: Directory 'ccpp-2018-05-23-16:21:07-27479' creation detected
May 23 16:21:09 oadb2 abrtd: Executable '/u01/app/11.2.0/grid/bin/ocssd.bin' doesn't belong to any package and ProcessUnpackaged is set to 'no'
May 23 16:21:09 oadb2 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2018-05-23-16:21:07-27479' exited with 1
May 23 16:21:09 oadb2 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2018-05-23-16:21:07-27479'
May 23 16:21:11 oadb2 ntpd[8087]: Deleting interface #8 bond1:1, 169.254.211.6#123, interface stats: received=0, sent=0, dropped=0, active_time=3515469 secs
7、5月 23 16:21节点2ASM日志提示,由于ASM数次尝试上线DATA磁盘组失败实例被PMON进程终止
Wed May 23 16:18:52 2018
ERROR: no read quorum in group: required 2, found 0 disks
NOTE: cache dismounting (clean) group 1/0x5ABDEE06 (DATA)
NOTE: messaging CKPT to quiesce pins Unix process pid: 29399, image: oracle@soadb2 (TNS V1-V3)
NOTE: dbwr not being msg'd to dismount
NOTE: lgwr not being msg'd to dismount
NOTE: cache dismounted group 1/0x5ABDEE06 (DATA)
NOTE: cache ending mount (fail) of group DATA number=1 incarn=0x5abdee06
NOTE: cache deleting context for group DATA 1/0x5abdee06
GMON dismounting group 1 at 37 for pid 29, osid 29399
ERROR: diskgroup DATA was not mounted
ORA-15032: not all alterations performed
ORA-15017: diskgroup "DATA" cannot be mounted
ORA-15040: diskgroup is incomplete
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
ERROR: ALTER DISKGROUP DATA MOUNT /* asm agent *//* {0:7:10794} */
Wed May 23 16:21:07 2018
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
Wed May 23 16:21:07 2018
PMON (ospid: 28806): terminating the instance due to error 481
Wed May 23 16:21:08 2018
ORA-1092 : opitsk aborting process
Wed May 23 16:21:08 2018
License high water mark = 12
Instance terminated by PMON, pid = 28806
USER (ospid: 52046): terminating the instance
Instance terminated by USER, pid = 52046
ERROR: no read quorum in group: required 2, found 0 disks
NOTE: cache dismounting (clean) group 1/0x5ABDEE06 (DATA)
NOTE: messaging CKPT to quiesce pins Unix process pid: 29399, image: oracle@soadb2 (TNS V1-V3)
NOTE: dbwr not being msg'd to dismount
NOTE: lgwr not being msg'd to dismount
NOTE: cache dismounted group 1/0x5ABDEE06 (DATA)
NOTE: cache ending mount (fail) of group DATA number=1 incarn=0x5abdee06
NOTE: cache deleting context for group DATA 1/0x5abdee06
GMON dismounting group 1 at 37 for pid 29, osid 29399
ERROR: diskgroup DATA was not mounted
ORA-15032: not all alterations performed
ORA-15017: diskgroup "DATA" cannot be mounted
ORA-15040: diskgroup is incomplete
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
ERROR: ALTER DISKGROUP DATA MOUNT /* asm agent *//* {0:7:10794} */
Wed May 23 16:21:07 2018
NOTE: ASMB process exiting, either shutdown is in progress
NOTE: or foreground connected to ASMB was killed.
Wed May 23 16:21:07 2018
PMON (ospid: 28806): terminating the instance due to error 481
Wed May 23 16:21:08 2018
ORA-1092 : opitsk aborting process
Wed May 23 16:21:08 2018
License high water mark = 12
Instance terminated by PMON, pid = 28806
USER (ospid: 52046): terminating the instance
Instance terminated by USER, pid = 52046
总结:OA系统,观察节点1的日志,相比节点2的日志OCR磁盘组没有出现Disable queueing,因此节点1的
DB和GI实例没有收到影响。
DB和GI实例没有收到影响。
编辑推荐:
- ORACLE 11.2.0.4 rac for linux 链路宕导致的单节点异常宕机03-03
- oracle11g模拟OCR和voting disk损坏恢复实验03-03
- Generate Script to kill multiple Oracle sessions03-03
- mysql使用performance_schema进行性能监控03-03
- 深度解析MySQL锁机制:间隙锁、Next-KeyLock与幻读防御03-03
- MySQL配置SSL加密访问的实现步骤03-03
- PostgreSQL通过mysql_fdw连通MySQL实战03-03
- MySQL日期和时间函数示例详解03-03
下一篇:
相关推荐
-
雷神推出 MIX PRO II 迷你主机:基于 Ultra 200H,玻璃上盖 + ARGB 灯效
2 月 9 日消息,雷神 (THUNDEROBOT) 现已宣布推出基于英
-
制造商 Musnap 推出彩色墨水屏电纸书 Ocean C:支持手写笔、第三方安卓应用
2 月 10 日消息,制造商 Musnap 现已在海外推出一款 Oce
热文推荐
- PostgreSQL通过mysql_fdw连通MySQL实战
PostgreSQL通过mysql_fdw连通MySQL实战
26-03-03 - Ubiquiti 推出 Wi-Fi 7 接入点 Unifi U7 Mesh:BE5000,兼容室内外环境
- Redis数据存储原理和结构解读
Redis数据存储原理和结构解读
26-03-03 - MySQL主从复制过滤配置的完整方案
MySQL主从复制过滤配置的完整方案
26-03-03 - Mysql严格模式小结
Mysql严格模式小结
26-03-03 - Oracle数据泵导入导出数据的实现
Oracle数据泵导入导出数据的实现
26-03-03 - 英特尔酷睿 Ultra 9 290HX Plus 处理器再曝:GeekBench 多核比 285HX 高 8%
- 《魔兽世界:至暗之夜》DLC 上线,微星发布联名限量 RTX 5070 显卡
- MONTECH 君主推出 BETA 2 系列 ATX 3.1 直出线铜牌电源
- 乔思伯推出屏显风冷散热器 CC90,双塔双风扇 285W D-TDP
乔思伯推出屏显风冷散热器 CC90,双塔双风扇 285W D-TDP
26-03-03
