ORA-600(ksqsgn:join)和ORA-7445(BREAKPOINT)错误

一个Windows环境下的RAC 10.2.0.4在添加节点时,原节点的ASM实例出现这两个错误。
错误信息如下:

Sat Jun 09 11:24:21 2012
 Starting ORACLE instance (normal)
 LICENSE_MAX_SESSION = 0
 LICENSE_SESSIONS_WARNING = 0
 Interface TYPE 1 GB2 192.168.7.0 configured FROM OCR FOR USE AS a cluster interconnect
 Interface TYPE 1 GB1 172.16.7.0 configured FROM OCR FOR USE AS a public interface
 Picked latch-free SCN scheme 3
 USING LOG_ARCHIVE_DEST_1 parameter DEFAULT VALUE AS C:\app\oracle\product\10.2.0\db_1\RDBMS
 Autotune OF undo retention IS turned off. 
LICENSE_MAX_USERS = 0
 SYS auditing IS disabled
 ksdpec: called FOR event 13740 prior TO event GROUP initialization
 Starting up ORACLE RDBMS Version: 10.2.0.4.0.
 System parameters WITH non-DEFAULT VALUES:
 large_pool_size = 12582912
 instance_type = asm
 cluster_database = TRUE
 instance_number = 1
 remote_login_passwordfile= EXCLUSIVE
 background_dump_dest = C:\APP\ORACLE\PRODUCT\10.2.0\ADMIN\+ASM\BDUMP
 user_dump_dest = C:\APP\ORACLE\PRODUCT\10.2.0\ADMIN\+ASM\UDUMP
 core_dump_dest = C:\APP\ORACLE\PRODUCT\10.2.0\ADMIN\+ASM\CDUMP
 ifile = C:\app\oracle\product\10.2.0\admin\+ASM\pfile\init.ora
 asm_diskgroups = DATA
 Cluster communication IS configured TO USE the following interface(s) FOR this instance
 172.16.7.213
 Sat Jun 09 11:24:21 2012
 cluster interconnect IPC version:Oracle 9i Winsock2 TCP/IP IPC
 IPC Vendor 0 proto 0
 Version 0.0
 PMON started WITH pid=2, OS id=6744
 DIAG started WITH pid=3, OS id=1504
 PSP0 started WITH pid=4, OS id=4216
 LMON started WITH pid=5, OS id=5304
 LMD0 started WITH pid=6, OS id=4164
 LMS0 started WITH pid=7, OS id=4132
 MMAN started WITH pid=8, OS id=5868
 DBW0 started WITH pid=9, OS id=5412
 LGWR started WITH pid=10, OS id=4212
 CKPT started WITH pid=11, OS id=7132
 SMON started WITH pid=12, OS id=6192
 RBAL started WITH pid=13, OS id=5924
 GMON started WITH pid=14, OS id=6900
 Sat Jun 09 11:24:47 2012
 lmon registered WITH NM - instance id 1 (internal mem no 0)
 Sat Jun 09 11:26:24 2012
 Error: KGXGN polling error (15)
 Sat Jun 09 11:26:24 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_lmon_5304.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
LMON: terminating instance due TO error 29702
 Sat Jun 09 11:26:24 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\udump\+asm1_ora_3564.trc:
 ORA-00600: internal error code, arguments: [ksqsgn:JOIN], [error IN lmon process], [32], [], [], [], [], []
 
Sat Jun 09 11:26:24 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_gmon_6900.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:24 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_smon_6192.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:24 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_ckpt_7132.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:24 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_lgwr_4212.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:24 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_dbw0_5412.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:25 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_lmd0_4164.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:25 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_lms0_4132.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:25 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_mman_5868.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:25 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_psp0_4216.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:25 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_pmon_6744.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:25 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_rbal_5924.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:38 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_diag_1504.trc:
 ORA-07445: exception encountered: core dump [BREAKPOINT] [unable_to_trans_pc] [PC:0x774F4EA0] [] [] []
 
Sat Jun 09 11:26:39 2012
 Instance TERMINATED BY LMON, pid = 5304

可以看到除了上述的ORA-600[ksqsgn:join]和ORA-7445[BREAKPOINT]错误外,还有一些明显的错误信息,包括ORA-29702、Error: KGXGN polling error (15)和Instance terminated by LMON等。

首先分析ORA-600错误,这个错误在metalink上都没有记录,虽然Windows环境下的RAC并不多见,但是如果这个错误在MOS中一个相关的记录都找不到,这种可能性并不太大。导致这种情况发生的原因一般不是当前版本或操作系统比较特殊,就是犯了一些低级的错误导致了一些平时很难发生的错误。如果根据错误函数名称和报错的第二参数,这个错误应该和之前的ORA-29702错误有直接的关系。

再看ORA-7445错误,这个错误还真找到一个相关度很高的bug:ORA-07445: Exception Encountered: Core Dump [Breakpoint] [Unable_to_trans_pc] [ID 1214796.1]。不过这个问题已经在Windows下的10.2.0.2 PATCH 9中进行了修复,理论上讲在10.2.0.4中不应该出现这个错误。

不过从7445错误的日志文件中找到了下列的信息:

*** 2012-06-09 11:24:47.376
 IPCConnect: unable TO CONNECT TO addr [+AS : 3420 : 2720 : 2314930], err 258
 IPCGetRequestInfo: failed a request rqh(0xd234bc0), TYPE(1), STATUS(2), bytes(0)
 Target port [node=1] IS no longer valid
 kjzgmapbcast:error encounter WHEN broadcasting

而前面分析了导致ORA-600错误的第一个ORA-29702错误的TRACE文件中,同样包含了相似的内容:

*** 2012-06-09 11:24:47.345
 IPCConnect: unable TO CONNECT TO addr [+AS : 3420 : 3212 : 2314930], err 258
 IPCGetRequestInfo: failed a request rqh(0x126b2da0), TYPE(1), STATUS(2), bytes(0)
 kjfcpiora: published my fusion master weight 5418
 kjfcpiora: publish my flogb 9
 kjxggpoll: CHANGE poll TIME TO 50 ms

这说明节点间的IPCS通信出现了异常。

再次检查上面的告警信息,发现了问题的根源,在启动ASM实例之前,Oracle根据CLUSTER的配置确定了192地址段为CLUSTER INTERCONNECT,而172地址段为PUBLIC INTERFACE,但是在实例启动后,确将172地址段作为CLUSTER通信的地址段,并最终导致了ORA-29702错误以及Error: KGXGN polling error (15)和CLUSTER通信超时有关的错误。

Windows上更改CLUSTER和PUBLIC网卡的配置需要重启生效,系统重启后,确认PRIVATE IP和PUBLIC IP设置无误后,启动CLUSTER和DB,问题消失。

This entry was posted in ORACLE and tagged , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *