核心系统数据库坏块

来源:这里教程网 时间:2026-03-03 18:50:27 作者:

环境介绍:

Oracle 版本:11.2.0.4.191015 ,操作系统:Redhat linux server 7.5 ,系统架构:两节点RAC

故障现象

业务层反馈数据库无法正常写入,无法切换归档日志。  

问题分析

1 Mon Oct 12 12:26:02 开始多次报错,内存坏块,持续到13  

Mon Oct 12 12:26:02 2020

Errors in file /opt/oracle/diag/rdbms/cwgkuap/cwgkuap2/trace/cwgkuap2_ora_183325.trc  (incident=314289):

ORA-00600: 内部错误代码, 参数: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []

Incident details in: /opt/oracle/diag/rdbms/cwgkuap/cwgkuap2/incident/incdir_314289/cwgkuap2_ora_183325_i314289.trc

Non critical error ORA-48913 caught while writing to trace file "/opt/oracle/diag/rdbms/cwgkuap/cwgkuap2/incident/incdir_314289/cwgkuap2_ora_183325_i314289.trc"

Error message: ORA-48913: 写入跟踪文件失败, 达到了文件大小限制 [5000000]

Writing to the above trace file is disabled for now on...

Mon Oct 12 12:26:08 2020

 

Tue Oct 13 08:53:38 2020

Errors in file /opt/oracle/diag/rdbms/cwgkuap/cwgkuap2/trace/cwgkuap2_ora_162939.trc  (incident=314046):

ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details. 2 Thu Oct 29 10:11:06 报错,  (34, 632163) corruption ,触发SQL 是一个更新操作,后续一直报错,并伴随blockrecover

Thu Oct 29 10:11:06 2020

Errors in file /opt/oracle/diag/rdbms/cwgkuap/cwgkuap2/trace/cwgkuap2_ora_368867.trc  (incident=313977):

ORA-00600: 内部错误代码, 参数: [17182], [0x7F780EE5EF08], [], [], [], [], [], [], [], [], [], []

Incident details in: /opt/oracle/diag/rdbms/cwgkuap/cwgkuap2/incident/incdir_313977/cwgkuap2_ora_368867_i313977.trc

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Non critical error ORA-48913 caught while writing to trace file "/opt/oracle/diag/rdbms/cwgkuap/cwgkuap2/incident/incdir_313977/cwgkuap2_ora_368867_i313977.trc"

Error message: ORA-48913: 写入跟踪文件失败, 达到了文件大小限制 [5000000]

Writing to the above trace file is disabled for now on...

 

 

Block after image is corrupt:

buffer tsn: 6 rdba: 0x002cffff (0/2949119)

scn: 0x340d.8f0e2302 seq: 0xc3 flg: 0x26 tail: 0xfc760601

frmt: 0x07 chkval: 0x4243 type: 0xff=unknown

Hex dump of corrupt header 4 = CORRUPT

 

Opcodes *.*

 DBAs (file#, block#):

 (34, 632163) .

 SCNs: scn: 0x0000.00000000 thru scn: 0xffff.ffffffff

 **NOTE: Only Dumping Redo less then 12 hours**

 Times: 10/28/2020 22:11:12 thru eternity

Initial buffer sizes: read 1024K, overflow 832K, change 805K

 

        UPDATE TRDATADETAIL4304 SET FN_95 = null , FN_96 = null , FN_97 = null , FN_98 = null , FN_99 = null , FN_100 = null , FN_101 = null , FN_102 = null , FN_103 = null , FN_104 = null , FN_105 = null , FN_106 = null , FN_107 = null , FN_108 = null , FN_109 = null , FN_110 = null , FN_111 = null , FN_112 = null , FN_113 = null , FN_114 = null , FN_115 = null , FN_116 = null , FN_117 = null , FN_118 = null , FN_119 = null , FN_120 = null , FN_121 = null , FN_122 = null , FN_123 = null , FN_124 = null , FN_125 = 

  2 Thu Oct 29 11:41:20 无法进行检查点,所有的redo 都为active redo 无法切换, 直到Thu Oct 29 15:54:14 重启之后才能正常归档,业务正常进行。 

Thu Oct 29 11:41:20 2020

Thread 2 cannot allocate new log, sequence 9655

Checkpoint not complete 3 、运行一段时间后,业务第二次反馈问题,检查发现无法进行检查点,所有的redo 都为active redo 无法切换。 

Thu Oct 29 17:13:58 2020

Thread 2 cannot allocate new log, sequence 9678

Checkpoint not complete 4 Thu Oct 29 19:14:21 节点2 shutdown abort ,开始人工处理,检查出坏块,并修复。   

Some indexes or index [sub]partitions of table FMIS0400.TRDATADETAIL5733_1 have been marked unusable

Some indexes or index [sub]partitions of table FMIS0400.TRDATADETAIL4304 have been marked unusable

Some indexes or index [sub]partitions of table FMIS0400.TRDATADETAIL3815_2 have been marked unusable

 

  5 Thu Oct 29 20:22:06 处理完成,启动二节点,后续无报错。  

问题综述

10-12 号有多次内存坏块报错,并没有影响到业务。29 号内存坏块,导致无法正常写入数据文件,无法进行检查点,redo 无法循环复写,数据库最终夯死, 重启后恢复,由于逻辑坏块没有修复,当天第二次触发,再次导致数据库hang ,修复完成后,后台无报错,恢复正常。

后续处理方案

1、 使用dbv工具检查全库,看是否还有坏块,排除风险。

2、 继续观察几天,如果有报错及时处理,收集诊断信息(见参考)。

3、 建议应用最新的补丁,规避潜在风险的BUG

4、 定期备份,并验证备份的有效性

5 、定期巡检,及时发现问题。

参考

1 、校验数据库坏块

--rman 校验坏块

# 数据库层面

VALIDATE DATABASE;

backup validate check logical database ;

# 数据文件层面

VALIDATE DATAFILE #file_id;

# 数据块层面

VALIDATE DATAFILE 1 BLOCK 10;

 

--dbv 校验坏块

# 校验数据文件

dbv userid={system/<password>} file={full path filename} logfile={output filename}

# 校验具体的表段和分区段等

 

 

dbv userid=system/manager SEGMENT_ID=2.5.37767

 

select t.ts#, s.header_file, s.header_block

    from v$tablespace t, dba_segments s

    where s.segment_name='TAB1'

and t.name = s.tablespace_name;

 2 、查询存在的坏块

-- 抽取损坏的数据文件段区块信息(快速查询)

create table corrupt_extents as

select *

  from dba_extents

 where file_id in (select file# from v$database_block_corruption);

 

create table corrupt_free_space

select *

  from dba_free_space

 where file_id in (select file# from v$database_block_corruption);

 

-- 查找出存在坏块的对象信息和块信息

 

set lines 200 pages 10000

col segment_name format a30

 

SELECT e.owner, e.segment_type, e.segment_name, e.partition_name, c.file#

, greatest(e.block_id, c.block#) corr_start_block#

, least(e.block_id+e.blocks-1, c.block#+c.blocks-1) corr_end_block#

, least(e.block_id+e.blocks-1, c.block#+c.blocks-1)

- greatest(e.block_id, c.block#) + 1 blocks_corrupted

, null description

FROM corrupt_extents e, v$database_block_corruption c

WHERE e.file_id = c.file#

AND e.block_id <= c.block# + c.blocks - 1

AND e.block_id + e.blocks - 1 >= c.block#

UNION

SELECT s.owner, s.segment_type, s.segment_name, s.partition_name, c.file#

, header_block corr_start_block#

, header_block corr_end_block#

, 1 blocks_corrupted

, 'Segment Header' description

FROM dba_segments s, v$database_block_corruption c

WHERE s.header_file = c.file#

AND s.header_block between c.block# and c.block# + c.blocks - 1

UNION

SELECT null owner, null segment_type, null segment_name, null partition_name, c.file#

, greatest(f.block_id, c.block#) corr_start_block#

, least(f.block_id+f.blocks-1, c.block#+c.blocks-1) corr_end_block#

, least(f.block_id+f.blocks-1, c.block#+c.blocks-1)

- greatest(f.block_id, c.block#) + 1 blocks_corrupted

, 'Free Block' description

FROM corrupt_free_space f, v$database_block_corruption c

WHERE f.file_id = c.file#

AND f.block_id <= c.block# + c.blocks - 1

AND f.block_id + f.blocks - 1 >= c.block#

ORDER BY file#, corr_start_block#;

 

 

 3 、校验相关的坏块和段信息 

-- 临时段再次校验

 

 

 

-- 表段分区段校验

#dbv 工具校验具体

select t.ts#, s.header_file, s.header_block

    from v$tablespace t, dba_segments s

    where s.segment_name='TAB1'

and t.name = s.tablespace_name;

 

dbv userid=system/manager SEGMENT_ID=2.5.37767

 

# 针对表段或者分区段的单块进行检查

select segment_name, segment_type, owner

       from corrupt_extents

      where file_id = <Absolute file number>

        and <corrupted block number> between block_id

            and block_id + blocks -1;

 

# 通过语句查询统计具体的表段或者分区段

 

Select count(*) from biao;

   10235 事件跟踪SGA 信息 *** WARNING *********************************************************** *** This event should only EVER be set at the request of Oracle Support. *** It can impact performance on most types of system. *** Level 2 and above can impact latch contention. *** Level 3 and above can have a *SEVERE* impact on performance. *** Level 65536 is special and has minimal performance impact *** but has a different purpose to other levels.

 

ALTER SYSTEM SET EVENTS '10235 trace name context forever, level 65536';

ALTER SYSTEM SET EVENTS '10235 trace name context off';

参考文档:

 

Causes and Solutions for ora-600 [kdsgrp1] (Doc ID 1332252.1)

Troubleshooting Assistant: Oracle Database Error ORA-600 [kdsgrp1] (Interactive Guide to Resolving Your Error) (Doc ID 1569722.2)

Important Customer information about using Numeric Events (Doc ID 75713.1)

 

相关推荐