客户产品数据库上午出现了严重的性能问题,简单记录一下问题的诊断和解决过程。
可以看到,数据库的DB TIME已经涨到了非常高的地步,这说明系统正经受着非常严重的性能问题:
Snap Id |
Snap Time |
Sessions |
Cursors/Session |
|
Begin Snap: |
7170 |
31-3月 -12 08:00:50 |
395 |
10.5 |
End Snap: |
7172 |
31-3月 -12 09:00:27 |
689 |
11.7 |
Elapsed: |
59.61 (mins) |
|||
DB Time: |
17,270.93 (mins) |
从TOP 5看,等待事件中最明显的是latch: cache buffers chains,从当前系统的状态也可以看到这一点:
SQL> SELECT EVENT, COUNT(*) FROM GV$SESSION GROUP BY EVENT HAVING COUNT(*) > 5 ORDER BY 2 DESC; EVENT COUNT(*) ---------------------------------------------------------------- ---------- latch: cache buffers chains 1043 SQL*Net message FROM client 500 SQL*Net message TO client 35 rdbms ipc message 29 gc cr request 25 latch free 13 gcs remote message 10 gc buffer busy 7 8 ROWS selected. |
系统中部分SQL的执行时间已经长达几个小时:
SQL ordered by Elapsed Time
Elapsed Time (s) |
CPU Time (s) |
Executions |
Elap per Exec (s) |
% Total DB Time |
SQL Id |
SQL Module |
SQL Text |
341,821 |
11,917 |
112 |
3051.97 |
32.99 |
|
JDBC Thin Client |
SELECT count(*) total FROM ( s… |
244,752 |
3,380 |
147 |
1664.98 |
23.62 |
|
JDBC Thin Client |
SELECT count(*) total FROM ( s… |
52,419 |
11,331 |
3 |
17472.91 |
5.06 |
|
JDBC Thin Client |
select cf.cardid, (CASE WHEN c… |
38,767 |
532 |
11 |
3524.24 |
3.74 |
|
JDBC Thin Client |
SELECT count(*) total FROM ( s… |
37,146 |
2,524 |
2 |
18573.15 |
3.58 |
JDBC Thin Client |
SELECT count(*) total FROM ( s… |
|
30,796 |
2,331 |
5 |
6159.20 |
2.97 |
|
JDBC Thin Client |
SELECT count(*) total FROM ( s… |
29,991 |
2,506 |
1 |
29991.41 |
2.89 |
|
JDBC Thin Client |
SELECT count(*) total FROM ( s… |
24,762 |
875 |
3 |
8254.06 |
2.39 |
|
JDBC Thin Client |
SELECT count(*) total FROM (SE… |
17,845 |
699 |
2 |
8922.50 |
1.72 |
|
JDBC Thin Client |
SELECT count(*) total FROM (SE… |
17,470 |
1,295 |
0 |
1.69 |
|
JDBC Thin Client |
select * from (select aa.*, ro… |
其实根据这些现象,基本上可以判断问题了。肯定是SQL执行计划的改变导致了当前的性能问题。而类似绑定变量窥探之类的问题只会影响个别的SQL,而这么大面积的执行计划的改变,几乎可以断定是统计信息造成的。
询问了一下客户最近的操作,原来昨天夜里通过数据库链的方式导入了一部分数据。而今天一早问题就出现了。
其实问题已经很明显了,在通过数据库链加载数据后,并没有重新收集统计信息,且由于加载时间是在半夜,这批数据也没有赶上每天22:00的统计信息自动收集的工作。这就使得Oracle在生成执行计划时,会依赖数据加载前的统计信息,从而造成了错误的执行计划。
首先解决问题的根源问题,对加载过数据的表重新收集统计:
SQL> SELECT 'EXEC DBMS_STATS.GATHER_TABLE_STATS(''USER1'', ''' || TABLE_NAME || ''', CASCADE => TRUE)' FROM DBA_TABLES WHERE OWNER = 'USER1' AND LAST_ANALYZED > TRUNC(SYSDATE); 'EXECDBMS_STATS.GATHER_TABLE_STATS(''USER1'','''||TABLE_NAME||''',CASCADE=>TRUE)' ----------------------------------------------------------------------------------------- EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_1', CASCADE => TRUE) EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_2', CASCADE => TRUE) . . . EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_3', CASCADE => TRUE) EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_4', CASCADE => TRUE) 12 ROWS selected. SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_1', CASCADE => TRUE) PL/SQL PROCEDURE successfully completed. SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_2', CASCADE => TRUE) PL/SQL PROCEDURE successfully completed. . . . SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_3', CASCADE => TRUE) PL/SQL PROCEDURE successfully completed. SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_4', CASCADE => TRUE) PL/SQL PROCEDURE successfully completed. |
虽然问题的根源已经被解决,但是当前运行的SQL并不会改变执行计划,因此还需要找到问题的SQL,从后台将其结束:
SQL> SELECT 'kill -9 ' || spid FROM v$session s, v$process p WHERE s.username = 'USER2' AND s.paddr = p.addr AND event = 'latch: cache buffers chains'; 'KILL-9'||SPID -------------------- KILL -9 28321 KILL -9 25384 KILL -9 23697 KILL -9 7239 . . . KILL -9 9331 KILL -9 13759 740 ROWS selected. SQL> SELECT 'kill -9 ' || spid FROM gv$session s, gv$process p WHERE s.username = 'USER2' AND s.paddr = p.addr AND event = 'latch: cache buffers chains' AND s.inst_id = p.inst_id AND s.inst_id = 2; 'KILL-9'||SPID -------------------- KILL -9 23992 KILL -9 5289 KILL -9 21067 KILL -9 16816 KILL -9 16820 KILL -9 26767 . . . KILL -9 14981 KILL -9 26678 KILL -9 26682 258 ROWS selected. |
分别在两个节点杀掉这些执行计划存在问题的会话,释放被大量占用的系统资源。
由于Oracle的执行计划并非在收集统计信息后马上生效,因此还有个别的SQL仍然沿用错误的执行计划:
SQL> SELECT DISTINCT inst_id, sql_id FROM gv$session WHERE event = 'latch: cache buffers chains'; INST_ID SQL_ID ---------- ------------- 1 39gvg7vbcm8jx 1 a6aqkm30u7p90 SQL> SELECT address, hash_value FROM v$sqlarea WHERE sql_id = 'a6aqkm30u7p90'; ADDRESS HASH_VALUE ---------------- ---------- C000000EB7ED3420 3248739616 SQL> EXEC dbms_shared_pool.purge('C000000EB7ED3420,3248739616','C') PL/SQL PROCEDURE successfully completed. SQL> SELECT address, hash_value FROM v$sqlarea WHERE sql_id ='39gvg7vbcm8jx'; ADDRESS HASH_VALUE ---------------- ---------- C000001037B8E308 3603538493 SQL> EXEC dbms_shared_pool.purge('C000001037B8E308’, ‘3603538493', 'C') PL/SQL PROCEDURE successfully completed. |
由于当前的数据库版本是10.2.0.5,因此可以很方便的使用dbms_shared_pool将执行计划错误的SQL清除出共享池,强制其重新生成执行计划。
SQL> SELECT event, COUNT(*) FROM gv$session WHERE username LIKE != USER GROUP BY EVENT ORDER BY 2 DESC; EVENT COUNT(*) ---------------------------------------------------------------- ---------- SQL*Net message FROM client 370 SQL*Net message TO client 15 gc cr request 10 latch free 4 Streams AQ: waiting FOR messages IN the queue 1 5 ROWS selected. |
数据库中SQL执行计划错误除了导致大量的latch: cache buffers chains等待以外,还存在gc cr request和latch free等这些等待事件,经分析同样是由于错误的执行计划所致。将这些会话采用相同的方法清除后,系统负载恢复到正常范围:
Snap Id |
Snap Time |
Sessions |
Cursors/Session |
|
Begin Snap: |
7188 |
31-3月 -12 14:55:17 |
257 |
20.3 |
End Snap: |
7189 |
31-3月 -12 15:03:07 |
256 |
19.9 |
Elapsed: |
7.84 (mins) |
|||
DB Time: |
70.46 (mins) |