全局事务锁等待分析

某客户数据库出现了严重的性能问题,导致应用出现大范围超时以及会话激增等问题,多次尝试kill session都无法彻底解决问题,重启后系统恢复正常。

拿到故障时刻的AWR报告,可以发现问题时刻,数据库的主要等待为:Global transaction acquire instance locks和enq: TX – row lock contention。

 

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
Global transaction acquire instance locks 5,342 5,343 1000 74.09 Configuration
enq: TX – row lock contention 5 1,437 287308 19.92 Application
DB CPU 331 4.59
direct path read 37,708 72 2 0.99 User I/O
log file sync 7,817 12 2 0.16 Commit

 

其中TX – row lock contention等待十分常见,这个等待事件造成应用的阻塞也很容易理解,但是Global transaction acquire instance locks并不是常见等待,从字面上理解,是全局事务在尝试获取实例锁。这个等待在等待时间占比上,消耗了将近75%的DBTIME。

当然数据库中TOP 5中最严重的等待不一定是问题的根源,

分析问题时刻的ASH信息,在问题时刻,最先出现的是全局事务获取锁的等待,随后开始出现行锁等待:

 

SQL> select to_char(sample_time, ‘hh24miss’), session_id, event, blocking_session

2  from dba_hist_active_sess_history

3  where sample_time >= to_date(‘201810180652’, ‘yyyymmddhh24mi’)

4  and sample_time <= to_date(‘201810180700’, ‘yyyymmddhh24mi’)

5  and instance_number = 1

6  order by 1, 2;

 

TO_CHA SESSION_ID EVENT                                      BLOCKING_SESSION

—— ———- —————————————— —————-

065204         69 Global transaction acquire instance locks

065214         69 Global transaction acquire instance locks

065224         69 Global transaction acquire instance locks

065224        535

065234         69 Global transaction acquire instance locks

065234        232

065234        535

065234        763 Global transaction acquire instance locks

065244         69 Global transaction acquire instance locks

065244        535 direct path read

065244        695 enq: TX – row lock contention                           763

065244        763 Global transaction acquire instance locks

065254         69 Global transaction acquire instance locks

065254        535

065254        695 enq: TX – row lock contention                           763

065254        763 Global transaction acquire instance locks

065304        136 Global transaction acquire instance locks

065304        695 enq: TX – row lock contention                           763

065304        763 Global transaction acquire instance locks

065314        136 Global transaction acquire instance locks

065314        695 enq: TX – row lock contention                           763

065314        763 Global transaction acquire instance locks

065324         69 Global transaction acquire instance locks

065324        136 Global transaction acquire instance locks

065324        695 enq: TX – row lock contention                           763

065324        763 Global transaction acquire instance locks

065334         69 Global transaction acquire instance locks

065334        136 Global transaction acquire instance locks

065334        695 enq: TX – row lock contention                           763

065344         69 Global transaction acquire instance locks

065344        136 Global transaction acquire instance locks

065344        434 enq: TX – row lock contention                           763

065344        695 enq: TX – row lock contention                           763

065354         69 Global transaction acquire instance locks

065354        136 Global transaction acquire instance locks

065354        434 enq: TX – row lock contention                           763

065354        695 enq: TX – row lock contention                           763

065404         69 Global transaction acquire instance locks

065404        136 Global transaction acquire instance locks

065404        434 enq: TX – row lock contention                           763

065404        695 enq: TX – row lock contention                           763

.

.

.

065944        302 enq: TX – row lock contention                           763

065944        336 enq: TX – row lock contention                           763

065944        434 enq: TX – row lock contention                           763

065944        695 enq: TX – row lock contention                           763

065954         71 Global transaction acquire instance locks

065954        302 enq: TX – row lock contention                           763

065954        336 enq: TX – row lock contention                           763

065954        434 enq: TX – row lock contention                           763

065954        695 enq: TX – row lock contention                           763

 

216 rows selected.

 

首先出现问题的是会话763,其处于Global transaction acquire instance locks等待中,该会话并未显示被其他会话阻塞。之后开始出现了行锁等待,这些等待enq: TX – row lock contention的会话,其阻塞对象都是会话763。

显然在本次故障中,虽然最终导致大范围业务挂起的是enq: TX – row lock contention等待,但是最终问题的根源是Global transaction acquire instance locks等待。

几乎与此同时,后台告警日志出现大量的报错:

 

Thu Oct 18 06:53:33 2018

opiodr aborting process unknown ospid (26428) as a result of ORA-24756

Thu Oct 18 06:53:36 2018

Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc:

ORA-24756: transaction does not exist

Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc:

ORA-24756: transaction does not exist

Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc:

ORA-24756: transaction does not exist

Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc:

ORA-24756: transaction does not exist

 

数据库中出现该等待的时间点06:52,再考虑到等待在报错前会经历一个超时,因此数据库中的等待与告警日志中的ORA-24756错误有密切的关系。

以ORA-24756作为关键字查询MOS,可以找到与当前现象非常接近的文章:PMON cleanup fails with ORA-24756: transaction does not exist (文档 ID 2299927.1)。

文档中描述的报错现象与当前问题一致,且数据库版本也很接近。Oracle认为这个错误是由于PMON进程在清理不存在的事务时出现了卡住的现象。Oracle给出了几个比较类似的bug,但是这些bug的状态不是不可重现就是已中止,因此类似的问题并没有明确的结论:

Bug 20676168 – PMON GOT CONTINUOUS ORA-24756 DURING SHUTDOWN <<<<<<<<<< Closed, Could Not Reproduce

Bug 16317766 – EXADATA : ORA-24756: TRANSACTION DOES NOT EXIST <<<<<<<<<<<< Suspended, Req’d Info not Avail

Bug 9757979 – PMON CLEANUP FAILS WITH ORA-24756: TRANSACTION DOES NOT EXIST <<<<<<<<<<< Closed, Could Not Reproduce

 

无论是Global transaction acquire instance locks等待还是后台alert日志中不断出现的ORA-24756错误,问题都指向Oracle的锁和全局事务处理。

而从11g以后,RAC的全局事务的处理由后台进程GTXn来自动维护。该进程是否启动受初始化参数global_txn_processes的控制,该参数默认值为1,也就是在数据库启动的时候会默认启动GTXn进程。

询问客户近期是否进行过应用和数据库的调整,可以确认的是客户的应用程序并未发生任何变化,在之前也未出现过类似的问题,不过在出现问题之前,数据库做过主库和Data Guard备库之间的切换演练,而切换演练完成后,恢复正常业务时,就出现了这次故障。显然这次切换演练的操作是被怀疑的重点。

在详细检查了alert告警日志后发现,在DATA GUARD演练后切换回主库时,GTXn进程未启动:

 

Thu Oct 18 02:36:18 2018

alter database commit to switchover to physical standby with session shutdown

ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY [Process Id: 4366] (orcl1)

.

.

.

Switchover: Primary controlfile converted to standby controlfile succesfully.

Switchover: Complete – Database shutdown required

Completed: alter database commit to switchover to physical standby with session shutdown

Thu Oct 18 02:45:27 2018

.

.

.

RFS[8]: Assigned to RFS process 3216

RFS[8]: Identified database type as ‘physical standby’: Client is ARCH pid 11990

Thu Oct 18 03:11:13 2018

alter database open

AUDIT_TRAIL initialization parameter is changed to OS, as DB is NOT compatible for database opened with read-only access

This instance was first to open

Beginning standby crash recovery.

Serial Media Recovery started

.

.

.

Thu Oct 18 03:11:13 2018

SMON: enabling cache recovery

Dictionary check beginning

Dictionary check complete

Database Characterset is ZHS16GBK

No Resource Manager plan active

Starting background process GTX0

Thu Oct 18 03:11:14 2018

GTX0 started with pid=51, OS id=5041

replication_dependency_tracking turned off (no async multimaster replication found)

Physical standby database opened for read only access.

Completed: alter database open

.

.

.

Thu Oct 18 04:57:19 2018

alter database commit to switchover to primary with session shutdown

ALTER DATABASE SWITCHOVER TO PRIMARY (orcl1)

Maximum wait for role transition is 15 minutes.

Switchover: Media recovery is still active

Role Change: Canceling MRP – no more redo to apply

.

.

.

Switchover: Complete – Database mounted as primary

Completed: alter database commit to switchover to primary with session shutdown

Thu Oct 18 04:57:39 2018

alter database open

This instance was first to open

Picked broadcast on commit scheme to generate SCNs

.

.

.

Thu Oct 18 04:57:44 2018

QMNC started with pid=41, OS id=22585

LOGSTDBY: Validating controlfile with logical metadata

LOGSTDBY: Validation complete

Completed: alter database open

.

.

.

Thu Oct 18 06:53:33 2018

opiodr aborting process unknown ospid (26428) as a result of ORA-24756

Thu Oct 18 06:53:36 2018

Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc:

ORA-24756: transaction does not exist

Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc:

.

.

.

Starting background process GTX0

Thu Oct 18 09:51:20 2018

GTX0 started with pid=36, OS id=32470

Starting background process RCBG

Thu Oct 18 09:51:20 2018

 

从上面节选的部分alert日志可以看到,在2点36分,主库执行了SWITCHOVER操作,开始切换演练。

在3点11分,数据库打开,alert中提示,当前数据库为物理备库,为只读打开模式,在随后的日志中加粗的部分,显示此时后台进程GTX0进程启动。

在4点57分,数据库再次执行SWITCHOVER切换,此时数据库并未完全SHUTDOWN,而是直接转换为MOUNT模式,并于4点57分44秒完成了数据库的OPEN过程。但是这个时间段的alert中并未发现GXTn进程的启动。

而在alert中发现GTX0进程再次出现,是故障之后实例1重启时刻的9点51分。

显然,问题已经非常明确,在数据库由物理备库切换为主库的过程中,GTXn进程没有启动,这导致了Oracle无法处理分布式事务的锁问题,因此前台会话出现Global transaction acquire instance locks等待,而后台PMON也无法正常的清理会话和事务。

由于实例2在SWITCHOVER切换过程中是被SHUTDOWN后重新启动的,因此实例2上的GTXn进程是正常启动的,这也是应用切换到实例2上不在出现故障的原因。

如果要避免类似的问题产生,一方面在进行DATA GUARD的切换演练后,在主库切换到PRIMARY模式后,再执行一次数据库重启的操作,确保所有必要的进程都会启动。另一方面,对于应用程序,如果确实需要修改其他数据库中的表,应该通过建立其他数据库连接的方式来实现,避免在数据库中通过数据库链的方式直接更新远端表。

 

This entry was posted in ORACLE and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *