一、环境介绍
Oracle Database for linux , 版本为: 11.2.0.4.0。 操作系统版本 Linux 2.6.32-754.12.1.el6.x86_64 #1 SMP Tue Apr 9 14:52:26 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux。 此节点为虚拟化部署的虚拟机环境,操作系统和存储系统均由运维人员划分控制。
二、现象
近一个月以来告警系统时而不时收到活动会数异常,但下一个采集周期(3分钟)后自动恢复。
三、ASH/AWR分析
3.1 查看告警时间点5分钟内活动会话数量情况
select dbid, instance_number, sample_id, sample_time, count(*) session_count from dba_hist_active_sess_history t where sample_time>=to_timestamp('2022-12-06 10:40:00','yyyy-mm-dd hh24:mi:ss') and sample_time< to_timestamp('2022-12-06 10:45:00','yyyy-mm-dd hh24:mi:ss') group by dbid, instance_number, sample_id, sample_time order by dbid, instance_number, sample_time;
3.2 通过上述 精确时间点查找TOP holder、wait事件
select t.dbid, t.sample_id, t.sample_time, t.instance_number, t.event, t.session_state, t.c session_count from (select t.*, rank() over(partition by dbid, instance_number, sample_time order by c desc) r from (select t.*, count(*) over(partition by dbid, instance_number, sample_time, event) c, row_number() over(partition by dbid, instance_number, sample_time, event order by 1) r1 from dba_hist_active_sess_history t where sample_time >to_timestamp('2022-12-06 10:41:50','yyyy-mm-dd hh24:mi:ss') and sample_time < to_timestamp('2022-12-06 10:41:55','yyyy-mm-dd hh24:mi:ss') ) t where r1 = 1) t where r < 5 order by dbid, instance_number, sample_time, r;
3.3 查看等待链路
select level lv, connect_by_isleaf isleaf, connect_by_iscycle iscycle, t.dbid, t.sample_id, t.sample_time, t.instance_number, t.session_id, t.sql_id, t.session_type, t.event, t.session_state, t.blocking_inst_id, t.blocking_session, t.blocking_session_status from (select * from dba_hist_active_sess_history where sample_time >to_timestamp('2022-12-06 10:41:50','yyyy-mm-dd hh24:mi:ss') and sample_time < to_timestamp('2022-12-06 10:41:55','yyyy-mm-dd hh24:mi:ss') ) t start with blocking_session is not null connect by nocycle prior dbid = dbid and prior sample_time = sample_time and prior blocking_session = session_id and prior blocking_session_serial# = session_serial# order siblings by dbid, sample_time;
3.4 堵塞的根源
select t.lv, t.iscycle, t.dbid, t.sample_id, t.sample_time, t.instance_number, t.session_id, t.sql_id, t.session_type, t.event, t.seq#, t.session_state, t.blocking_inst_id, t.blocking_session, t.blocking_session_status, t.c blocking_session_count from (select t.*, row_number() over(partition by dbid, instance_number, sample_time order by c desc) r from (select t.*, count(*) over(partition by dbid, instance_number, sample_time, session_id) c, row_number() over(partition by dbid, instance_number, sample_time, session_id order by 1) r1 from (select level lv, connect_by_isleaf isleaf, connect_by_iscycle iscycle, t.* from dba_hist_active_sess_history t where sample_time >to_timestamp('2022-12-06 10:41:50','yyyy-mm-dd hh24:mi:ss') and sample_time < to_timestamp('2022-12-06 10:41:55','yyyy-mm-dd hh24:mi:ss') start with blocking_session is not null connect by nocycle prior dbid = dbid and prior blocking_inst_id = instance_number and prior blocking_session = session_id and prior blocking_session_serial# = session_serial#) t where t.isleaf = 1) t where r1 = 1) t where r < 3 order by dbid, sample_time, r;
3.5 结果分析
主要是control file sequential read/log file parallel write两种事件,而且均为后台进程。
初步怀疑 与存储性能抖动相关。
3.6 AWR报告
由于snapshot的间隔默认为 60 分钟,所以利用 awrrpt.sql 和 awrddrpt.sql 获取的AWR 报告,没有特别可参考的内容。主要的等待事件为: log file sync 也没有特别消耗资源的业务SQL。
3.7 部署iostat
命令如下:#
nohup iostat -x 10 > /tmp/iostat_chensh.log
日志格式如下:
通过部署iostat 的日志进行定位,在8006781行, 定位到 %util 最大值为
100.03%,且当前的CPU 的%iowait 达
26%。
3.7.1 计算日志中行数对应的分钟数(每10秒采集一次,每次38行)
select 8006781/6/38 from dual; -- 35117
3.7.2 确认活动会话异常的时间点
根据部署iostat的初始时间为 2022-11-12 01:24:52,结合上一步骤获取的分钟数,准确定位异常的时间点: select date_add(date_format(' 2022-11-12 01:24:52','%Y-%m-%d %H:%i:%s'), interval 35117 minute); -- 2022-12-06 10:41:52 此时间点与Oracle数据字典 v$active_session_history 定位的并发会话数的高峰时间点完全吻合。
3.8 结论
经与虚拟机运维人员确认,虚拟化运维查看了记录,近一个月在Oracle所在vsan集群有多台虚机重装系统,重装后会有数据回写,数据回写会抢占IO。
4 部署建议
虚拟化虽然成为潮流, 但对于高并发的Oracle数据库业务应用,应该独立部署在独立而非共享的存储之上。因而,建议将此数据库虚拟节点关联的存储,升级为 独占型磁盘;此外,在迁移任务完成之前尽量避免涉及虚拟机存储池的过频繁的 I/O。
