最近在测试生产数据库时发现数据库在open过程很长慢,大概持续20分钟左右,有时持续1-2个小时。日志如下: Fri Nov 23 10:53:51 2018 SMON: enabling tx recovery Database Characterset is AL32UTF8 Fri Nov 23 10:54:02 2018 Decreasing number of real time LMS from 2 to 0 Fri Nov 23 10:54:05 2018 No Resource Manager plan active Fri Nov 23 10:54:47 2018 Starting background process GTX0 Fri Nov 23 10:54:47 2018 GTX0 started with pid=41, OS id=14549130 Starting background process RCBG Fri Nov 23 10:54:48 2018<================== RCBG started with pid=43, OS id=14811234 replication_dependency_tracking turned off (no async multimaster replication found) Fri Nov 23 14:07:01 2018<==================== 从RCBG进程启动到QMNC进程启动之间间隔了180分钟左右,而其他open过程占用一分钟左右。 Starting background process QMNC Fri Nov 23 14:07:01 2018 QMNC started with pid=31, OS id=28246190 Fri Nov 23 14:07:57 2018 Completed: alter database open 。 为了进一步分析 RCBG 到 QMNC 启动慢的原因,做如下日志日志跟踪: 1. 做oracel hanganalyze 数据库启动到mount 状态 在看到alert log打印如下字样时,在另外的sqlplus窗口下收集3次,间隔2~3分钟收集一次: 步骤如下: sqlplus / as sysdba oradebug setmypid oradebug -G all dump systemstate 267 oradebug -G all hanganalyze 2 等2~3分钟 oradebug -G all dump systemstate 267 oradebug -G all hanganalyze 2 等2~3分钟 oradebug -G all dump systemstate 267 oradebug -G all hanganalyze 2 oradebug tracefile_name alter 日志产生如下文件 No Resource Manager plan active Fri Nov 23 15:09:25 2018 Starting background process GTX0 Fri Nov 23 15:09:25 2018 GTX0 started with pid=41, OS id=8585312 Starting background process RCBG Fri Nov 23 15:09:26 2018 RCBG started with pid=43, OS id=18809034 replication_dependency_tracking turned off (no async multimaster replication found) Fri Nov 23 15:39:33 2018 System State dumped to trace file /u01/app/oracle/diag/rdbms/erpdb/ erpdb 1/trace/erpdb 1_diag_31850618.trc 从文件上看oracle 在open过程中等待如下
1. -- Trace File -- Starting Systemstate 1 ......................................... Ass.Awk Version 1.0.31 ~~~~~~~~~~~~~~~~~~~~~~ Source file : symbols1_diag_31850618.trc System State 1 (2018-11-23 15:39:32.932) ~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~ 1: [DEAD] 2: waiting for 'pmon timer' 3: waiting for 'rdbms ipc message' 4: waiting for 'VKTM Logical Idle Wait' 5: waiting for 'rdbms ipc message' 6: 7: waiting for 'rdbms ipc message' 8: waiting for 'PING' 9: waiting for 'rdbms ipc message' 10: waiting for 'DIAG idle wait' 11: waiting for 'rdbms ipc message' 12: waiting for 'ges remote message'<===lmd 13: waiting for 'gcs remote message'<===lms 14: waiting for 'gcs remote message'<===lms 15: waiting for 'rdbms ipc message' 16: waiting for 'GCR sleep' 17: waiting for 'rdbms ipc message' 18: waiting for 'rdbms ipc message' 19: waiting for 'rdbms ipc message' 20: waiting for 'rdbms ipc message' 21: waiting for 'rdbms ipc message' 22: waiting for 'smon timer' 23: waiting for 'rdbms ipc message' 24: waiting for 'rdbms ipc message' 25: waiting for 'ASM background timer' 26: waiting for 'rdbms ipc message' 27: waiting for 'rdbms ipc message' 28: 29: 30: waiting for 'wait for unread message on broadcast channel' 31: waiting for 'class slave wait' 32: waiting for 'rdbms ipc message' 33: waiting for 'rdbms ipc message' 34: waiting for 'db file scattered read' (0x4f,0xa2766,0xb) 35: waiting for 'rdbms ipc reply' (from_process=6)[IPC 6] 37: waiting for 'rdbms ipc message' 38: waiting for 'rdbms ipc message' 39: waiting for 'rdbms ipc message' 40: waiting for 'rdbms ipc message' 41: waiting for 'rdbms ipc message' 43: waiting for 'wait for unread message on broadcast channel'
2. flags2: (0x409) -/-/INC DID: , short-term DID: txn branch: 0x0 edition#: 0 oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS ksuxds FALSE at location: 0 service name: SYS$BACKGROUND Current Wait Stack: 0: waiting for 'gcs remote message' waittime=0x1e, poll=0x0, event=0x0 wait_id=51479 seq_num=51480 snap_id=1 wait times: snap=0.000577 sec, exc=0.000577 sec, total=0.000577 sec wait times: max=0.030000 sec, heur=31 min 42 sec<============== 应该是实例启动到dump时总共等待时间,这儿说明私网gcs消息传输可能有问题。 为了排除私有网卡的问题部署oswatch,从oswatch收集的信息看,网络和资源都没有问题。 2. 为了进一步确定原因 truss -aefdo /tmp/truss.out sqlplus / as sysdba alter database open; 从日志上分析可以看到 14876838: 56950981: 2.0001: kread(9, "01 -\0\006\0\0\0\0\011 i".., 8208) (sleeping...) 21954596: 92471451: 2.0001: kread(0, "\n l t e r d a t a b a".., 4096) (sleeping...) 21954596: 92471451: kread(0, " e x i t\n d a t a b a".., 4096) = 5 21954596: 92471451: 1344.6921: kfcntl(1, F_GETFL, 0x0000000000000008) = 67110914<=== 从它前面的kread读取的终端输入是exit来看应该是已经完成了open执行推出之后的os call了,所以上面的1344.6921秒是不能计入open过程的,排除这个之后来看,主要问题就是aio了, listio64 数目高达123644次,而单次等待从0.0003到0.5271s不等 14876838: 56950981: kpread(24, 0x07000107C5755000, 4096, 0x0000004FBAD2A000, 0x000000000000007D) = 4096 14876838: 56950981: kpread(24, 0x07000107C5E14000, 4096, 0x000000E41E85F000, 0x000000000000007D) = 4096 14876838: 56950981: 0.5271: listio64(0x0000000010000004, 0x000000000FFFFFFF, 0x00000000FFFCC1E0, 0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x00000001116E89A0, 0x0000000000000040) = 0x0000000000000000 , ---Continue--- 有一点还是很奇怪,按理open的时候已经完成了crash recovery了,此时对数据文件只有读没有写了,但为何您这里出来这么多 kpread,还是令人不解。 为什么从RCBG进程启动到QMNC进程启动之间间隔很长时间那,而其他后台进程open过程占用一分钟左右。 从上边的分析,问题又回到了原点。难道是oracle bug, 为了进一步确认问题,在mos搜到如下一篇文章: Procedure to Manually Coalesce All the IOTs / Indexes Associated with Advanced Queueing Tables to Maintain Enqueue / Dequeue Performance; Reduce QMON CPU Usage and Redo Generation ( 文档 ID 271855.1) 为了验证文章的可靠性,通过此文章提供的脚本,进行修复,数据库能正常OPEN,并且很快。日志如下: Database Characterset is AL32UTF8 No Resource Manager plan active Starting background process GTX0 Fri Nov 23 16:39:33 2018 GTX0 started with pid=36, OS id=16056540 Starting background process RCBG Fri Nov 23 16:41:23 2018 RCBG started with pid=37, OS id=26214482 replication_dependency_tracking turned off (no async multimaster replication found) Starting background process QMNC Fri Nov 23 16:42:23 2018 QMNC started with pid=39, OS id=14287094 Fri Nov 23 16:43:33 2018 Completed: ALTER DATABASE OPEN 这个案例的后续是,如果经常使用AQ,数据库可以不定期做碎片整理。
