客户数据库出现大量cache buffer chains latch

客户产品数据库上午出现了严重的性能问题,简单记录一下问题的诊断和解决过程。

可以看到,数据库的DB TIME已经涨到了非常高的地步,这说明系统正经受着非常严重的性能问题:

Snap Id

Snap Time

Sessions

Cursors/Session

Begin Snap:

7170

31-3月 -12 08:00:50

395

10.5

End Snap:

7172

31-3月 -12 09:00:27

689

11.7

Elapsed:

59.61 (mins)

DB Time:

17,270.93 (mins)

 

从TOP 5看,等待事件中最明显的是latch: cache buffers chains,从当前系统的状态也可以看到这一点:

SQL> SELECT EVENT, COUNT(*) FROM GV$SESSION GROUP BY EVENT HAVING COUNT(*) > 5 ORDER BY 2 DESC;
 
EVENT                                                              COUNT(*)
---------------------------------------------------------------- ----------
latch: cache buffers chains                                            1043
SQL*Net message FROM client                                             500
SQL*Net message TO client                                                35
rdbms ipc message                                                        29
gc cr request                                                            25
latch free                                                               13
gcs remote message                                                       10
gc buffer busy                                                            7
 
8 ROWS selected.

系统中部分SQL的执行时间已经长达几个小时:

SQL ordered by Elapsed Time

Elapsed Time (s)

CPU Time (s)

Executions

Elap per Exec   (s)

% Total DB Time

SQL Id

SQL Module

SQL Text

341,821

11,917

112

3051.97

32.99

JDBC Thin Client

SELECT count(*)   total FROM ( s…

244,752

3,380

147

1664.98

23.62

JDBC Thin Client

SELECT count(*)   total FROM ( s…

52,419

11,331

3

17472.91

5.06

JDBC Thin Client

select   cf.cardid, (CASE WHEN c…

38,767

532

11

3524.24

3.74

JDBC Thin Client

SELECT count(*)   total FROM ( s…

37,146

2,524

2

18573.15

3.58

JDBC Thin Client

SELECT count(*)   total FROM ( s…

30,796

2,331

5

6159.20

2.97

JDBC Thin Client

SELECT count(*)   total FROM ( s…

29,991

2,506

1

29991.41

2.89

JDBC Thin Client

SELECT count(*)   total FROM ( s…

24,762

875

3

8254.06

2.39

JDBC Thin Client

SELECT count(*)   total FROM (SE…

17,845

699

2

8922.50

1.72

JDBC Thin Client

SELECT count(*)   total FROM (SE…

17,470

1,295

0

1.69

JDBC Thin Client

select * from   (select aa.*, ro…

 

 

其实根据这些现象,基本上可以判断问题了。肯定是SQL执行计划的改变导致了当前的性能问题。而类似绑定变量窥探之类的问题只会影响个别的SQL,而这么大面积的执行计划的改变,几乎可以断定是统计信息造成的。

询问了一下客户最近的操作,原来昨天夜里通过数据库链的方式导入了一部分数据。而今天一早问题就出现了。

其实问题已经很明显了,在通过数据库链加载数据后,并没有重新收集统计信息,且由于加载时间是在半夜,这批数据也没有赶上每天22:00的统计信息自动收集的工作。这就使得Oracle在生成执行计划时,会依赖数据加载前的统计信息,从而造成了错误的执行计划。

首先解决问题的根源问题,对加载过数据的表重新收集统计:

SQL> SELECT 'EXEC DBMS_STATS.GATHER_TABLE_STATS(''USER1'', ''' || TABLE_NAME || ''', CASCADE => TRUE)' FROM DBA_TABLES WHERE OWNER = 'USER1' AND LAST_ANALYZED > TRUNC(SYSDATE);
 
'EXECDBMS_STATS.GATHER_TABLE_STATS(''USER1'','''||TABLE_NAME||''',CASCADE=>TRUE)'
-----------------------------------------------------------------------------------------
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_1', CASCADE => TRUE)
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_2', CASCADE => TRUE)
.
.
.
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_3', CASCADE => TRUE)
EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', 'TABLE_4', CASCADE => TRUE)
 
12 ROWS selected.
 
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_1', CASCADE => TRUE)
 
PL/SQL PROCEDURE successfully completed.
 
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_2', CASCADE => TRUE)
 
PL/SQL PROCEDURE successfully completed.
 
.
.
.
 
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_3', CASCADE => TRUE)
 
PL/SQL PROCEDURE successfully completed.
 
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS('USER1', ' TABLE_4', CASCADE => TRUE)
 
PL/SQL PROCEDURE successfully completed.

虽然问题的根源已经被解决,但是当前运行的SQL并不会改变执行计划,因此还需要找到问题的SQL,从后台将其结束:

SQL> SELECT 'kill -9 ' || spid FROM v$session s, v$process p WHERE s.username = 'USER2' AND s.paddr = p.addr AND event = 'latch: cache buffers chains';
 
'KILL-9'||SPID
--------------------
KILL -9 28321
KILL -9 25384
KILL -9 23697
KILL -9 7239
.
.
.
KILL -9 9331
KILL -9 13759
 
740 ROWS selected.
 
SQL> SELECT 'kill -9 ' || spid FROM gv$session s, gv$process p WHERE s.username = 'USER2' AND s.paddr = p.addr AND event = 'latch: cache buffers chains' AND s.inst_id = p.inst_id AND s.inst_id = 2;
 
'KILL-9'||SPID
--------------------
KILL -9 23992
KILL -9 5289
KILL -9 21067
KILL -9 16816
KILL -9 16820
KILL -9 26767
.
.
.
KILL -9 14981
KILL -9 26678
KILL -9 26682
 
258 ROWS selected.

分别在两个节点杀掉这些执行计划存在问题的会话,释放被大量占用的系统资源。

由于Oracle的执行计划并非在收集统计信息后马上生效,因此还有个别的SQL仍然沿用错误的执行计划:

SQL> SELECT DISTINCT inst_id, sql_id FROM gv$session WHERE event = 'latch: cache buffers chains';
 
INST_ID SQL_ID
---------- -------------
1 39gvg7vbcm8jx
1 a6aqkm30u7p90
 
SQL> SELECT address, hash_value FROM v$sqlarea WHERE sql_id = 'a6aqkm30u7p90';
 
ADDRESS HASH_VALUE
---------------- ----------
C000000EB7ED3420 3248739616
 
SQL> EXEC dbms_shared_pool.purge('C000000EB7ED3420,3248739616','C')
 
PL/SQL PROCEDURE successfully completed.
 
SQL> SELECT address, hash_value FROM v$sqlarea WHERE sql_id ='39gvg7vbcm8jx';
 
ADDRESS HASH_VALUE
---------------- ----------
C000001037B8E308 3603538493
 
SQL> EXEC dbms_shared_pool.purge('C000001037B8E308’, ‘3603538493', 'C')
 
PL/SQL PROCEDURE successfully completed.

由于当前的数据库版本是10.2.0.5,因此可以很方便的使用dbms_shared_pool将执行计划错误的SQL清除出共享池,强制其重新生成执行计划。

SQL> SELECT event, COUNT(*) FROM gv$session WHERE username LIKE != USER GROUP BY EVENT ORDER BY 2 DESC;
 
EVENT                                                              COUNT(*)
---------------------------------------------------------------- ----------
SQL*Net message FROM client                                             370
SQL*Net message TO client                                                15
gc cr request                                                            10
latch free                                                                4
Streams AQ: waiting FOR messages IN the queue                             1
 
5 ROWS selected.

数据库中SQL执行计划错误除了导致大量的latch: cache buffers chains等待以外,还存在gc cr request和latch free等这些等待事件,经分析同样是由于错误的执行计划所致。将这些会话采用相同的方法清除后,系统负载恢复到正常范围:

Snap Id

Snap Time

Sessions

Cursors/Session

Begin Snap:

7188

31-3月 -12 14:55:17

257

20.3

End Snap:

7189

31-3月 -12 15:03:07

256

19.9

Elapsed:

7.84 (mins)

DB Time:

70.46 (mins)

 

This entry was posted in ORACLE and tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *