记一次打开数据库慢原因分析过程

来源:这里教程网 时间:2026-03-03 18:28:24 作者:

最近在测试生产数据库时发现数据库在open过程很长慢,大概持续20分钟左右,有时持续1-2个小时。日志如下: Fri Nov 23 10:53:51 2018  SMON: enabling tx recovery  Database Characterset is AL32UTF8  Fri Nov 23 10:54:02 2018  Decreasing number of real time LMS from 2 to 0  Fri Nov 23 10:54:05 2018  No Resource Manager plan active  Fri Nov 23 10:54:47 2018  Starting background process GTX0  Fri Nov 23 10:54:47 2018  GTX0 started with pid=41, OS id=14549130    Starting background process RCBG  Fri Nov 23 10:54:48 2018<==================  RCBG started with pid=43, OS id=14811234    replication_dependency_tracking turned off (no async multimaster replication found)  Fri Nov 23 14:07:01 2018<==================== 从RCBG进程启动到QMNC进程启动之间间隔了180分钟左右,而其他open过程占用一分钟左右。  Starting background process QMNC  Fri Nov 23 14:07:01 2018  QMNC started with pid=31, OS id=28246190    Fri Nov 23 14:07:57 2018  Completed: alter database open  为了进一步分析 RCBG QMNC 启动慢的原因,做如下日志日志跟踪: 1. 做oracel hanganalyze 数据库启动到mount 状态 在看到alert log打印如下字样时,在另外的sqlplus窗口下收集3次,间隔2~3分钟收集一次: 步骤如下: sqlplus / as sysdba oradebug setmypid oradebug -G all dump systemstate 267 oradebug -G all hanganalyze 2 等2~3分钟 oradebug -G all dump systemstate 267 oradebug -G all hanganalyze 2 等2~3分钟 oradebug -G all dump systemstate 267 oradebug -G all hanganalyze 2 oradebug tracefile_name alter 日志产生如下文件 No Resource Manager plan active  Fri Nov 23 15:09:25 2018  Starting background process GTX0  Fri Nov 23 15:09:25 2018  GTX0 started with pid=41, OS id=8585312    Starting background process RCBG  Fri Nov 23 15:09:26 2018  RCBG started with pid=43, OS id=18809034    replication_dependency_tracking turned off (no async multimaster replication found)  Fri Nov 23 15:39:33 2018  System State dumped to trace file /u01/app/oracle/diag/rdbms/erpdb/ erpdb 1/trace/erpdb 1_diag_31850618.trc 从文件上看oracle 在open过程中等待如下

1.    -- Trace File --  Starting Systemstate 1  .........................................  Ass.Awk Version 1.0.31  ~~~~~~~~~~~~~~~~~~~~~~  Source file : symbols1_diag_31850618.trc  System State 1 (2018-11-23 15:39:32.932)  ~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~  1: [DEAD]  2: waiting for 'pmon timer'  3: waiting for 'rdbms ipc message'  4: waiting for 'VKTM Logical Idle Wait'  5: waiting for 'rdbms ipc message'  6:  7: waiting for 'rdbms ipc message'  8: waiting for 'PING'  9: waiting for 'rdbms ipc message'  10: waiting for 'DIAG idle wait'  11: waiting for 'rdbms ipc message'  12: waiting for 'ges remote message'<===lmd  13: waiting for 'gcs remote message'<===lms  14: waiting for 'gcs remote message'<===lms  15: waiting for 'rdbms ipc message'  16: waiting for 'GCR sleep'  17: waiting for 'rdbms ipc message'  18: waiting for 'rdbms ipc message'  19: waiting for 'rdbms ipc message'  20: waiting for 'rdbms ipc message'  21: waiting for 'rdbms ipc message'  22: waiting for 'smon timer'  23: waiting for 'rdbms ipc message'  24: waiting for 'rdbms ipc message'  25: waiting for 'ASM background timer'  26: waiting for 'rdbms ipc message'  27: waiting for 'rdbms ipc message'  28:  29:  30: waiting for 'wait for unread message on broadcast channel'  31: waiting for 'class slave wait'  32: waiting for 'rdbms ipc message'  33: waiting for 'rdbms ipc message'  34: waiting for 'db file scattered read' (0x4f,0xa2766,0xb)  35: waiting for 'rdbms ipc reply' (from_process=6)[IPC 6]  37: waiting for 'rdbms ipc message'  38: waiting for 'rdbms ipc message'  39: waiting for 'rdbms ipc message'  40: waiting for 'rdbms ipc message'  41: waiting for 'rdbms ipc message'  43: waiting for 'wait for unread message on broadcast channel' 

2.      flags2: (0x409) -/-/INC  DID: , short-term DID:  txn branch: 0x0  edition#: 0 oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS  ksuxds FALSE at location: 0  service name: SYS$BACKGROUND  Current Wait Stack:  0: waiting for 'gcs remote message'  waittime=0x1e, poll=0x0, event=0x0  wait_id=51479 seq_num=51480 snap_id=1  wait times: snap=0.000577 sec, exc=0.000577 sec, total=0.000577 sec  wait times: max=0.030000 sec, heur=31 min 42 sec<============== 应该是实例启动到dump时总共等待时间,这儿说明私网gcs消息传输可能有问题。    为了排除私有网卡的问题部署oswatch,从oswatch收集的信息看,网络和资源都没有问题。  2. 为了进一步确定原因 truss -aefdo /tmp/truss.out sqlplus / as sysdba alter database open; 从日志上分析可以看到 14876838: 56950981: 2.0001: kread(9, "01 -\0\006\0\0\0\0\011 i".., 8208) (sleeping...)  21954596: 92471451: 2.0001: kread(0, "\n l t e r d a t a b a".., 4096) (sleeping...)  21954596: 92471451: kread(0, " e x i t\n d a t a b a".., 4096) = 5  21954596: 92471451: 1344.6921: kfcntl(1, F_GETFL, 0x0000000000000008) = 67110914<===  从它前面的kread读取的终端输入是exit来看应该是已经完成了open执行推出之后的os call了,所以上面的1344.6921秒是不能计入open过程的,排除这个之后来看,主要问题就是aio了, listio64 数目高达123644次,而单次等待从0.0003到0.5271s不等  14876838: 56950981: kpread(24, 0x07000107C5755000, 4096, 0x0000004FBAD2A000, 0x000000000000007D) = 4096  14876838: 56950981: kpread(24, 0x07000107C5E14000, 4096, 0x000000E41E85F000, 0x000000000000007D) = 4096  14876838: 56950981: 0.5271: listio64(0x0000000010000004, 0x000000000FFFFFFF, 0x00000000FFFCC1E0, 0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x00000001116E89A0, 0x0000000000000040) = 0x0000000000000000 ---Continue---    有一点还是很奇怪,按理open的时候已经完成了crash recovery了,此时对数据文件只有读没有写了,但为何您这里出来这么多 kpread,还是令人不解。 为什么从RCBG进程启动到QMNC进程启动之间间隔很长时间那,而其他后台进程open过程占用一分钟左右。 从上边的分析,问题又回到了原点。难道是oracle bug, 为了进一步确认问题,在mos搜到如下一篇文章: Procedure to Manually Coalesce All the IOTs / Indexes Associated with Advanced Queueing Tables to Maintain Enqueue / Dequeue Performance; Reduce QMON CPU Usage and Redo Generation ( 文档 ID 271855.1) 为了验证文章的可靠性,通过此文章提供的脚本,进行修复,数据库能正常OPEN,并且很快。日志如下: Database Characterset is AL32UTF8 No Resource Manager plan active Starting background process GTX0 Fri Nov 23 16:39:33 2018  GTX0 started with pid=36, OS id=16056540 Starting background process RCBG Fri Nov 23 16:41:23 2018  RCBG started with pid=37, OS id=26214482 replication_dependency_tracking turned off (no async multimaster replication found) Starting background process QMNC Fri Nov 23 16:42:23 2018  QMNC started with pid=39, OS id=14287094 Fri Nov 23 16:43:33 2018  Completed: ALTER DATABASE OPEN   这个案例的后续是,如果经常使用AQ,数据库可以不定期做碎片整理。

相关推荐