[20190506]视图嵌套与绑定变量.txt
--//生产系统上线遇到的问题,5月1日上线,因为放假的缘故使用的人少,问题没有暴露出来,5月5号大爆发,我今天5月6日早上开始介入查看.
--//实际上昨天下午就开始查,因为快下班,同事给的sys秘密不对,无法登录数据库.
--//今天上午拿到登录用户也是普通用户,不过权限可以查看许多dba视图,应该足够.不过不能抽取awr报表.
--//另外使用system用户抽取awr报表马上报ora-03113错误,注视乎是有问题那段时间抽取都报错.
1.环境:
> @ ver1
PORT_STRING VERSION BANNER
------------------------------ -------------- ----------------------------------------------------------------
IBMPC/WIN_NT-8.1.0 10.2.0.3.0 Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - Prod
--//10.2.0.3版本,什么现在上线还使用10g的产品,问了一下,居然还是32位版本,据说windows 2008R2的版本对方安装64版失败.
--//结果使用这个版本.顺便看了一下硬件配置32G内存,64位版本.
2.问题:
--//实际上同事已经给出主要等待事件是出现大量'library cache lock',导致大量用户无法执行sql语句.
--//我开始想也许开发可能修改某个存储过程,导致包失效,看来定位不难.唯独缺点是现在登录已经看不到这个等待事件.
--//只能做事后分析.
SELECT event, COUNT (*)
FROM DBA_HIST_ACTIVE_SESS_HISTORY
WHERE sample_time BETWEEN '2019/5/5 08:00:00' AND '2019/5/5 12:00:00'
GROUP BY event
ORDER BY 2 DESC;
EVENT COUNT(*)
--------------------------- --------
library cache lock 10445
cursor: pin S wait on X 4693
1735
db file scattered read 300
library cache pin 264
read by other session 168
latch: shared pool 149
db file sequential read 74
latch: library cache 24
latch free 3
control file parallel write 2
log file parallel write 1
direct path write 1
log file sync 1
os thread startup 1
--//主要集中在library cache lock和cursor: pin S wait on X.
--//时间放大10倍.
--//10445*10=1066064,1066064/3600 = 296小时.
> @ ev_name 'library cache lock'
old 1: select * from v$event_name where lower(name) like lower('%&&1%')
new 1: select * from v$event_name where lower(name) like lower('%library cache lock%')
EVENT# EVENT_ID NAME PARAMETER1 PARAMETER2 PARAMETER3 WAIT_CLASS_ID WAIT_CLASS# WAIT_CLASS
---------- ---------- ---------------------------------------- -------------------- -------------------- -------------------- ------------- ----------- --------------------
213 2032051689 latch: library cache lock address number tries 3875070507 4 Concurrency
216 916468430 library cache lock handle address lock address 100*mode+namespace 3875070507 4 Concurrency
--//10g居然在library cache lock有一个latch.
SELECT p1,count(*)
FROM DBA_HIST_ACTIVE_SESS_HISTORY
WHERE sample_time BETWEEN '2019/5/5 08:00:00' AND '2019/5/5 12:00:00' and EVENT ='library cache lock'
group by p1
order by 2 desc;
P1 COUNT(*)
---------- ----------
2727911292 7953
2506679444 1216
2729104776 552
2655119196 284
2688420528 263
2222045940 121
2311090624 54
2717330580 1
2734098764 1
> @ 10to16 2727911292
10 to 16 HEX REVERSE16
---------------- -------------------
00000000a2989f7c 0x7c9f98a2-00000000
> select kglnaown "Owner", kglnaobj "Object" from x$kglob where kglhdadr='00000000a2989f7c';
no rows selected
--//已经无法查询到,也就是已经不再共享池,实际上重启数据库多次.
3.分析:
--//看到通过sql_id能否发现问题:
SELECT *
FROM v$sqlstats
WHERE sql_id IN (SELECT DISTINCT sql_id
FROM DBA_HIST_ACTIVE_SESS_HISTORY
WHERE sample_time BETWEEN '2019/5/5 08:00:00'
AND '2019/5/5 12:00:00'
AND EVENT = 'library cache lock'
AND sql_id IS NOT NULL);
--//仅仅看到2条语句:
SELECT COUNT (*)
FROM (SELECT *
FROM scm_port_dept_order
WHERE state = 7
AND store_id = '19'
AND EXISTS
(SELECT b.*
FROM zdp_SCM_ORDER_DEPT b
WHERE b.SOLUTION_ID = 'SCM_ORDER_DEPT'
AND b.UNIT_ID = 'DEPARTMENT_SCM'
AND b.USER_ID = '0000001775'
AND b.OPTION_ITEM_ID =
scm_port_dept_order.dept_id)) t;
--//下面开始出现灾难...我在toad下按ctrl+e看执行计划,toad马上hang在哪里.我尝试几次都是一样.
--//我自己尝试执行1次,再看执行计划就很快,我才想起toad设置被我设置为可以直接看真实的执行计划的模式,参考链接:
http://blog.itpub.net/267265/viewspace-2220688/
--//这个时候我才发现执行计划N复杂,我才发现里面的zdp_SCM_ORDER_DEPT实际上一个异常复杂的视图.(明显命名规则不合理)
> select * from dba_objects where object_name='ZDP_SCM_ORDER_DEPT';
OWNER OBJECT_NAME SUBOBJECT_ OBJECT_ID DATA_OBJECT_ID OBJECT_TYPE CREATED LAST_DDL_TIME TIMESTAMP STATUS T G S
---------- -------------------- ---------- ---------- -------------- ------------------- ------------------- ------------------- ------------------- ------- - - -
xxxxxxxxxx ZDP_SCM_ORDER_DEPT 203756 VIEW 2019-04-18 17:57:42 2019-05-06 15:44:04 2019-05-06:12:47:11 VALID N N N
--//OBJECT_ID=203756.
Select a.object_id, a.object_type, a.object_name,
b.owner ref_owner, b.object_type ref_type, b.object_name ref_name, b.object_id ref_id, b.status ref_status
from sys.DBA_OBJECTS a,
sys.DBA_OBJECTS b,
(Select object_id, referenced_object_id
from (select object_id, referenced_object_id
from public_dependency
where referenced_object_id <> object_id) pd
start with object_id = 203756
connect by prior referenced_object_id = object_id) c
where a.object_id = c.object_id
and b.object_id = c.referenced_object_id
and a.owner not in ('SYS', 'SYSTEM')
and b.owner not in ('SYS', 'SYSTEM')
and a.object_name <> 'DUAL'
and b.object_name <> 'DUAL'
--//注:这条语句我写不出来,我使用toad自带SQL Tracker跟踪在toad的schema browser浏览deps(users)结果我不贴出来了,实际上有511行.
--//实际上的情况是视图里面1堆视图,在看还是一堆视图,一直嵌套7,8层之多才是正是的表,即使最后里面还是有1些是视图.
--//我仅仅贴一个图,仅仅是冰山一角.
--//正是这样的复杂视图,导致做硬分析耗费大量的cpu资源.加上前面的语句都是文字变量,存在大量的硬分析.出现'library cache lock', --//'cursor: pin S wait on X'就不足为怪. --//解决方法很简单执行: alter system set cursor_sharing=force scope=memory; --//alter system set cursor_sharing=force scope=spfile; --//这样一定程度环境减少硬解析.不过这些仅仅是治标不是治本.只要分析表,简直就是悲剧的开始... --//我真心佩服开发,怎么能想出这么复杂的视图.加上大量非绑定变量,导致大量硬解析.而且32位共享池不会很大,这样导致许多语句反复解析. > show sga Total System Global Area 1258291200 bytes Fixed Size 1374076 bytes Variable Size 665454724 bytes Database Buffers 587202560 bytes Redo Buffers 4259840 bytes > select sum( BYTES) from v$sgastat where POOL ='shared pool'; SUM(BYTES) ---------- 629169996 --//600M --//顺便说一下下午已经重启数据库: > select OPEN_TIME from v$thread ; OPEN_TIME ------------------- 2019-05-06 14:37:41 > select sysdate from dual; SYSDATE ------------------- 2019-05-06 16:27:28 SELECT event, COUNT (*) FROM V$ACTIVE_SESSION_HISTORY WHERE sample_time BETWEEN '2019/5/6 14:00:00' AND sysdate GROUP BY event ORDER BY 2 DESC; EVENT COUNT(*) ---------------------------------------- ---------- 6030 db file sequential read 333 cursor: pin S wait on X 181 db file scattered read 131 log file sync 31 log file parallel write 13 control file parallel write 6 SQL*Net more data from client 5 direct path read 4 db file parallel write 2 SQL*Net more data to client 2 control file sequential read 2 direct path write temp 2 latch: cache buffers chains 2 log buffer space 1 library cache lock 1 sort segment request 1 os thread startup 1 direct path write 1 19 rows selected. --//library cache lock 等待事件已经消失. --//很明显这个数据库IO还有问题,cursor: pin S wait on X高也是正常的.扫描一下一大堆sql语句要优化. --//不知道对方如何初始化数据库的,一大堆全表扫描,有一些表仅仅不到1000条记录,占了300M(高水位问题) --//看了这样的项目真心无语..................... 总结: --//建议:估计我讲没用... 1.不要把视图定义搞得这么复杂.改根本行不通...^_^. 2.合理的使用绑定变量. 3.不要安装32位版本,至少选择11.2.0.4 64位版本. 4.重新整理数据导入数据库,里面一些表仅仅几条记录,确占用大量磁盘空间.不知道对方如何初始化数据. 5.优化sql语句. 6.疑问:我很奇怪这样的项目在别的医院如何跑起来的,只能一种解析,良好的硬件就是一块遮羞布,掩盖一切应用的问题.
