一个Windows环境下的RAC 10.2.0.4在添加节点时,原节点的ASM实例出现这两个错误。
错误信息如下:
Sat Jun 09 11:24:21 2012 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Interface TYPE 1 GB2 192.168.7.0 configured FROM OCR FOR USE AS a cluster interconnect Interface TYPE 1 GB1 172.16.7.0 configured FROM OCR FOR USE AS a public interface Picked latch-free SCN scheme 3 USING LOG_ARCHIVE_DEST_1 parameter DEFAULT VALUE AS C:\app\oracle\product\10.2.0\db_1\RDBMS Autotune OF undo retention IS turned off. LICENSE_MAX_USERS = 0 SYS auditing IS disabled ksdpec: called FOR event 13740 prior TO event GROUP initialization Starting up ORACLE RDBMS Version: 10.2.0.4.0. System parameters WITH non-DEFAULT VALUES: large_pool_size = 12582912 instance_type = asm cluster_database = TRUE instance_number = 1 remote_login_passwordfile= EXCLUSIVE background_dump_dest = C:\APP\ORACLE\PRODUCT\10.2.0\ADMIN\+ASM\BDUMP user_dump_dest = C:\APP\ORACLE\PRODUCT\10.2.0\ADMIN\+ASM\UDUMP core_dump_dest = C:\APP\ORACLE\PRODUCT\10.2.0\ADMIN\+ASM\CDUMP ifile = C:\app\oracle\product\10.2.0\admin\+ASM\pfile\init.ora asm_diskgroups = DATA Cluster communication IS configured TO USE the following interface(s) FOR this instance 172.16.7.213 Sat Jun 09 11:24:21 2012 cluster interconnect IPC version:Oracle 9i Winsock2 TCP/IP IPC IPC Vendor 0 proto 0 Version 0.0 PMON started WITH pid=2, OS id=6744 DIAG started WITH pid=3, OS id=1504 PSP0 started WITH pid=4, OS id=4216 LMON started WITH pid=5, OS id=5304 LMD0 started WITH pid=6, OS id=4164 LMS0 started WITH pid=7, OS id=4132 MMAN started WITH pid=8, OS id=5868 DBW0 started WITH pid=9, OS id=5412 LGWR started WITH pid=10, OS id=4212 CKPT started WITH pid=11, OS id=7132 SMON started WITH pid=12, OS id=6192 RBAL started WITH pid=13, OS id=5924 GMON started WITH pid=14, OS id=6900 Sat Jun 09 11:24:47 2012 lmon registered WITH NM - instance id 1 (internal mem no 0) Sat Jun 09 11:26:24 2012 Error: KGXGN polling error (15) Sat Jun 09 11:26:24 2012 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_lmon_5304.trc: ORA-29702: error occurred IN Cluster GROUP Service operation LMON: terminating instance due TO error 29702 Sat Jun 09 11:26:24 2012 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\udump\+asm1_ora_3564.trc: ORA-00600: internal error code, arguments: [ksqsgn:JOIN], [error IN lmon process], [32], [], [], [], [], [] Sat Jun 09 11:26:24 2012 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_gmon_6900.trc: ORA-29702: error occurred IN Cluster GROUP Service operation Sat Jun 09 11:26:24 2012 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_smon_6192.trc: ORA-29702: error occurred IN Cluster GROUP Service operation Sat Jun 09 11:26:24 2012 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_ckpt_7132.trc: ORA-29702: error occurred IN Cluster GROUP Service operation Sat Jun 09 11:26:24 2012 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_lgwr_4212.trc: ORA-29702: error occurred IN Cluster GROUP Service operation Sat Jun 09 11:26:24 2012 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_dbw0_5412.trc: ORA-29702: error occurred IN Cluster GROUP Service operation Sat Jun 09 11:26:25 2012 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_lmd0_4164.trc: ORA-29702: error occurred IN Cluster GROUP Service operation Sat Jun 09 11:26:25 2012 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_lms0_4132.trc: ORA-29702: error occurred IN Cluster GROUP Service operation Sat Jun 09 11:26:25 2012 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_mman_5868.trc: ORA-29702: error occurred IN Cluster GROUP Service operation Sat Jun 09 11:26:25 2012 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_psp0_4216.trc: ORA-29702: error occurred IN Cluster GROUP Service operation Sat Jun 09 11:26:25 2012 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_pmon_6744.trc: ORA-29702: error occurred IN Cluster GROUP Service operation Sat Jun 09 11:26:25 2012 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_rbal_5924.trc: ORA-29702: error occurred IN Cluster GROUP Service operation Sat Jun 09 11:26:38 2012 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_diag_1504.trc: ORA-07445: exception encountered: core dump [BREAKPOINT] [unable_to_trans_pc] [PC:0x774F4EA0] [] [] [] Sat Jun 09 11:26:39 2012 Instance TERMINATED BY LMON, pid = 5304 |
可以看到除了上述的ORA-600[ksqsgn:join]和ORA-7445[BREAKPOINT]错误外,还有一些明显的错误信息,包括ORA-29702、Error: KGXGN polling error (15)和Instance terminated by LMON等。
首先分析ORA-600错误,这个错误在metalink上都没有记录,虽然Windows环境下的RAC并不多见,但是如果这个错误在MOS中一个相关的记录都找不到,这种可能性并不太大。导致这种情况发生的原因一般不是当前版本或操作系统比较特殊,就是犯了一些低级的错误导致了一些平时很难发生的错误。如果根据错误函数名称和报错的第二参数,这个错误应该和之前的ORA-29702错误有直接的关系。
再看ORA-7445错误,这个错误还真找到一个相关度很高的bug:ORA-07445: Exception Encountered: Core Dump [Breakpoint] [Unable_to_trans_pc] [ID 1214796.1]。不过这个问题已经在Windows下的10.2.0.2 PATCH 9中进行了修复,理论上讲在10.2.0.4中不应该出现这个错误。
不过从7445错误的日志文件中找到了下列的信息:
*** 2012-06-09 11:24:47.376 IPCConnect: unable TO CONNECT TO addr [+AS : 3420 : 2720 : 2314930], err 258 IPCGetRequestInfo: failed a request rqh(0xd234bc0), TYPE(1), STATUS(2), bytes(0) Target port [node=1] IS no longer valid kjzgmapbcast:error encounter WHEN broadcasting |
而前面分析了导致ORA-600错误的第一个ORA-29702错误的TRACE文件中,同样包含了相似的内容:
*** 2012-06-09 11:24:47.345 IPCConnect: unable TO CONNECT TO addr [+AS : 3420 : 3212 : 2314930], err 258 IPCGetRequestInfo: failed a request rqh(0x126b2da0), TYPE(1), STATUS(2), bytes(0) kjfcpiora: published my fusion master weight 5418 kjfcpiora: publish my flogb 9 kjxggpoll: CHANGE poll TIME TO 50 ms |
这说明节点间的IPCS通信出现了异常。
再次检查上面的告警信息,发现了问题的根源,在启动ASM实例之前,Oracle根据CLUSTER的配置确定了192地址段为CLUSTER INTERCONNECT,而172地址段为PUBLIC INTERFACE,但是在实例启动后,确将172地址段作为CLUSTER通信的地址段,并最终导致了ORA-29702错误以及Error: KGXGN polling error (15)和CLUSTER通信超时有关的错误。
Windows上更改CLUSTER和PUBLIC网卡的配置需要重启生效,系统重启后,确认PRIVATE IP和PUBLIC IP设置无误后,启动CLUSTER和DB,问题消失。