环境介绍:
Oracle 版本:11.2.0.4.191015 ,操作系统:Redhat linux server 7.5 ,系统架构:两节点RAC 。
故障现象
业务层反馈数据库无法正常写入,无法切换归档日志。
问题分析
1 、Mon Oct 12 12:26:02 开始多次报错,内存坏块,持续到13 号
Mon Oct 12 12:26:02 2020
Errors in file /opt/oracle/diag/rdbms/cwgkuap/cwgkuap2/trace/cwgkuap2_ora_183325.trc (incident=314289):
ORA-00600: 内部错误代码, 参数: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /opt/oracle/diag/rdbms/cwgkuap/cwgkuap2/incident/incdir_314289/cwgkuap2_ora_183325_i314289.trc
Non critical error ORA-48913 caught while writing to trace file "/opt/oracle/diag/rdbms/cwgkuap/cwgkuap2/incident/incdir_314289/cwgkuap2_ora_183325_i314289.trc"
Error message: ORA-48913: 写入跟踪文件失败, 达到了文件大小限制 [5000000]
Writing to the above trace file is disabled for now on...
Mon Oct 12 12:26:08 2020
Tue Oct 13 08:53:38 2020
Errors in file /opt/oracle/diag/rdbms/cwgkuap/cwgkuap2/trace/cwgkuap2_ora_162939.trc (incident=314046):
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details. 2 、Thu Oct 29 10:11:06 报错, (34, 632163) 块corruption ,触发SQL 是一个更新操作,后续一直报错,并伴随blockrecover 。
Thu Oct 29 10:11:06 2020
Errors in file /opt/oracle/diag/rdbms/cwgkuap/cwgkuap2/trace/cwgkuap2_ora_368867.trc (incident=313977):
ORA-00600: 内部错误代码, 参数: [17182], [0x7F780EE5EF08], [], [], [], [], [], [], [], [], [], []
Incident details in: /opt/oracle/diag/rdbms/cwgkuap/cwgkuap2/incident/incdir_313977/cwgkuap2_ora_368867_i313977.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Non critical error ORA-48913 caught while writing to trace file "/opt/oracle/diag/rdbms/cwgkuap/cwgkuap2/incident/incdir_313977/cwgkuap2_ora_368867_i313977.trc"
Error message: ORA-48913: 写入跟踪文件失败, 达到了文件大小限制 [5000000]
Writing to the above trace file is disabled for now on...
Block after image is corrupt:
buffer tsn: 6 rdba: 0x002cffff (0/2949119)
scn: 0x340d.8f0e2302 seq: 0xc3 flg: 0x26 tail: 0xfc760601
frmt: 0x07 chkval: 0x4243 type: 0xff=unknown
Hex dump of corrupt header 4 = CORRUPT
Opcodes *.*
DBAs (file#, block#):
(34, 632163) .
SCNs: scn: 0x0000.00000000 thru scn: 0xffff.ffffffff
**NOTE: Only Dumping Redo less then 12 hours**
Times: 10/28/2020 22:11:12 thru eternity
Initial buffer sizes: read 1024K, overflow 832K, change 805K
UPDATE TRDATADETAIL4304 SET FN_95 = null , FN_96 = null , FN_97 = null , FN_98 = null , FN_99 = null , FN_100 = null , FN_101 = null , FN_102 = null , FN_103 = null , FN_104 = null , FN_105 = null , FN_106 = null , FN_107 = null , FN_108 = null , FN_109 = null , FN_110 = null , FN_111 = null , FN_112 = null , FN_113 = null , FN_114 = null , FN_115 = null , FN_116 = null , FN_117 = null , FN_118 = null , FN_119 = null , FN_120 = null , FN_121 = null , FN_122 = null , FN_123 = null , FN_124 = null , FN_125 =
2 、 Thu Oct 29 11:41:20 无法进行检查点,所有的redo 都为active ,redo 无法切换, 直到Thu Oct 29 15:54:14 重启之后才能正常归档,业务正常进行。
Thu Oct 29 11:41:20 2020
Thread 2 cannot allocate new log, sequence 9655
Checkpoint not complete 3 、运行一段时间后,业务第二次反馈问题,检查发现无法进行检查点,所有的redo 都为active ,redo 无法切换。
Thu Oct 29 17:13:58 2020
Thread 2 cannot allocate new log, sequence 9678
Checkpoint not complete 4 、Thu Oct 29 19:14:21 节点2 shutdown abort ,开始人工处理,检查出坏块,并修复。
Some indexes or index [sub]partitions of table FMIS0400.TRDATADETAIL5733_1 have been marked unusable
Some indexes or index [sub]partitions of table FMIS0400.TRDATADETAIL4304 have been marked unusable
Some indexes or index [sub]partitions of table FMIS0400.TRDATADETAIL3815_2 have been marked unusable
5 、Thu Oct 29 20:22:06 处理完成,启动二节点,后续无报错。
问题综述
10-12 号有多次内存坏块报错,并没有影响到业务。29 号内存坏块,导致无法正常写入数据文件,无法进行检查点,redo 无法循环复写,数据库最终夯死, 重启后恢复,由于逻辑坏块没有修复,当天第二次触发,再次导致数据库hang ,修复完成后,后台无报错,恢复正常。
后续处理方案
1、 使用dbv工具检查全库,看是否还有坏块,排除风险。
2、 继续观察几天,如果有报错及时处理,收集诊断信息(见参考)。
3、 建议应用最新的补丁,规避潜在风险的BUG
4、 定期备份,并验证备份的有效性
5 、定期巡检,及时发现问题。
参考
1 、校验数据库坏块
--rman 校验坏块
# 数据库层面
VALIDATE DATABASE;
backup validate check logical database ;
# 数据文件层面
VALIDATE DATAFILE #file_id;
# 数据块层面
VALIDATE DATAFILE 1 BLOCK 10;
--dbv 校验坏块
# 校验数据文件
dbv userid={system/<password>} file={full path filename} logfile={output filename}
# 校验具体的表段和分区段等
dbv userid=system/manager SEGMENT_ID=2.5.37767
select t.ts#, s.header_file, s.header_block
from v$tablespace t, dba_segments s
where s.segment_name='TAB1'
and t.name = s.tablespace_name;
2 、查询存在的坏块
-- 抽取损坏的数据文件段区块信息(快速查询)
create table corrupt_extents as
select *
from dba_extents
where file_id in (select file# from v$database_block_corruption);
create table corrupt_free_space
select *
from dba_free_space
where file_id in (select file# from v$database_block_corruption);
-- 查找出存在坏块的对象信息和块信息
set lines 200 pages 10000
col segment_name format a30
SELECT e.owner, e.segment_type, e.segment_name, e.partition_name, c.file#
, greatest(e.block_id, c.block#) corr_start_block#
, least(e.block_id+e.blocks-1, c.block#+c.blocks-1) corr_end_block#
, least(e.block_id+e.blocks-1, c.block#+c.blocks-1)
- greatest(e.block_id, c.block#) + 1 blocks_corrupted
, null description
FROM corrupt_extents e, v$database_block_corruption c
WHERE e.file_id = c.file#
AND e.block_id <= c.block# + c.blocks - 1
AND e.block_id + e.blocks - 1 >= c.block#
UNION
SELECT s.owner, s.segment_type, s.segment_name, s.partition_name, c.file#
, header_block corr_start_block#
, header_block corr_end_block#
, 1 blocks_corrupted
, 'Segment Header' description
FROM dba_segments s, v$database_block_corruption c
WHERE s.header_file = c.file#
AND s.header_block between c.block# and c.block# + c.blocks - 1
UNION
SELECT null owner, null segment_type, null segment_name, null partition_name, c.file#
, greatest(f.block_id, c.block#) corr_start_block#
, least(f.block_id+f.blocks-1, c.block#+c.blocks-1) corr_end_block#
, least(f.block_id+f.blocks-1, c.block#+c.blocks-1)
- greatest(f.block_id, c.block#) + 1 blocks_corrupted
, 'Free Block' description
FROM corrupt_free_space f, v$database_block_corruption c
WHERE f.file_id = c.file#
AND f.block_id <= c.block# + c.blocks - 1
AND f.block_id + f.blocks - 1 >= c.block#
ORDER BY file#, corr_start_block#;
3 、校验相关的坏块和段信息
-- 临时段再次校验
-- 表段分区段校验
#dbv 工具校验具体
select t.ts#, s.header_file, s.header_block
from v$tablespace t, dba_segments s
where s.segment_name='TAB1'
and t.name = s.tablespace_name;
dbv userid=system/manager SEGMENT_ID=2.5.37767
# 针对表段或者分区段的单块进行检查
select segment_name, segment_type, owner
from corrupt_extents
where file_id = <Absolute file number>
and <corrupted block number> between block_id
and block_id + blocks -1;
# 通过语句查询统计具体的表段或者分区段
Select count(*) from biao;
10235 事件跟踪SGA 信息 *** WARNING *********************************************************** *** This event should only EVER be set at the request of Oracle Support. *** It can impact performance on most types of system. *** Level 2 and above can impact latch contention. *** Level 3 and above can have a *SEVERE* impact on performance. *** Level 65536 is special and has minimal performance impact *** but has a different purpose to other levels.
ALTER SYSTEM SET EVENTS '10235 trace name context forever, level 65536';
ALTER SYSTEM SET EVENTS '10235 trace name context off';
参考文档:
Causes and Solutions for ora-600 [kdsgrp1] (Doc ID 1332252.1)
Troubleshooting Assistant: Oracle Database Error ORA-600 [kdsgrp1] (Interactive Guide to Resolving Your Error) (Doc ID 1569722.2)
Important Customer information about using Numeric Events (Doc ID 75713.1)
