客户的数据库出现了ORA-600(kfnsBackground03)错误。
数据库版本为10.2.0.3 RAC for HP-UX 11.23。这个错误在ASM实例和数据库实例都可能出现,如果发生在ASM实例,并不会导致ASM实例的崩溃,但是如果发生在数据库实例,则会导致数据库实例被强制关闭:
Tue May 15 10:28:05 2012 NOTE: DATABASE ORCL1:ORCL failed during msg 19, reply 2 Tue May 15 10:32:50 2012 NOTE: DATABASE ORCL1:ORCL failed during msg 19, reply 2 Tue May 15 10:33:05 2012 NOTE: DATABASE ORCL1:ORCL failed during msg 19, reply 2 Tue May 15 10:34:44 2012 NOTE: DATABASE ORCL1:ORCL failed during msg 19, reply 2 Tue May 15 10:43:05 2012 NOTE: DATABASE ORCL1:ORCL failed during msg 19, reply 2 Tue May 15 10:46:13 2012 Errors IN file /u01/app/oracle/admin/+ASM/udump/+asm1_ora_18846.trc: ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], [] Tue May 15 10:46:14 2012 Trace dumping IS performing id=[cdmp_20120515104614] |
上面是ASM实例的报错,下面是对应时刻数据库实例的报错:
Tue May 15 10:38:12 2012 kkjcre1p: unable TO spawn jobq slave process Tue May 15 10:38:12 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl1_cjq0_17957.trc: Tue May 15 10:42:19 2012 PMON failed TO acquire latch, see PMON dump Tue May 15 10:43:04 2012 found dead shared server 'S006', pid = (90, 4) Tue May 15 10:43:10 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl1_j000_19938.trc: ORA-12012: error ON auto EXECUTE OF job 42579 ORA-27468: "EXFSYS.RLM$EVTCLEANUP" IS locked BY another process Tue May 15 10:45:06 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl1_j002_23628.trc: ORA-12012: error ON auto EXECUTE OF job 8888975 ORA-27468: "ORCL.P_DATA_C" IS locked BY another process Tue May 15 10:45:10 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl1_j003_23959.trc: ORA-12012: error ON auto EXECUTE OF job 8855572 ORA-27468: "ORCL.P_DATA" IS locked BY another process Tue May 15 10:46:14 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl1_asmb_18844.trc: ORA-15064: communication failure WITH ASM instance ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], [] Tue May 15 10:46:14 2012 ASMB: terminating instance due TO error 15064 Tue May 15 10:46:15 2012 System state dump IS made FOR LOCAL instance System State dumped TO trace file /u01/app/oracle/admin/ORCL/bdump/orcl1_diag_17903.trc Tue May 15 10:46:16 2012 Shutting down instance (abort) License high water mark = 52 |
如果从这次数据库的实例崩溃看,问题似乎和主机上的资源耗尽有关。在问题发生之前,数据库实例已经出现了kkjcre1p: unable to spawn jobq slave process和PMON failed to acquire latch的问题。
当时其他时刻出现这个错误时,似乎并没有确定的资源不足的信息:
Sat May 26 09:47:49 2012 NOTE: DATABASE ORCL1:ORCL failed during msg 19, reply 2 Sat May 26 09:49:44 2012 NOTE: DATABASE ORCL1:ORCL failed during msg 19, reply 2 Sat May 26 09:52:23 2012 Errors IN file /u01/app/oracle/admin/+ASM/udump/+asm1_ora_21722.trc: ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], [] Sat May 26 09:52:25 2012 Trace dumping IS performing id=[cdmp_20120526095225] |
对应这个时刻的数据库告警信息为:
Sat May 26 09:52:24 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl1_asmb_21720.trc: ORA-15064: communication failure WITH ASM instance ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], [] Sat May 26 09:52:24 2012 ASMB: terminating instance due TO error 15064 Sat May 26 09:52:25 2012 System state dump IS made FOR LOCAL instance System State dumped TO trace file /u01/app/oracle/admin/ORCL/bdump/orcl1_diag_20837.trc Sat May 26 09:52:26 2012 Shutting down instance (abort) License high water mark = 46 Sat May 26 09:52:30 2012 Instance TERMINATED BY ASMB, pid = 21720 Sat May 26 09:52:31 2012 Instance TERMINATED BY USER, pid = 536 |
这次错误的出现并没有任何其他的信息,数据库实例就直接DOWN掉了。不过每次在出现这个错误时,ASM实例上都会存在告警信息:NOTE: database ORCL1:ORCL failed during msg 19, reply 2。这说明ASM实例和数据库的通信存在了问题。kfnsBackground是Kernel Files Network Service Background的缩写。其中MSG 19是指IOSTAT,而reply 2指的是TIMEOUT,这说明ASM在进行io操作是出现了timeout导致了ASM的异常并导致实例的崩溃。
这个错误相对比较罕见,整个METALINK中,只有3篇文章和这个错误相关,其中两篇是和归档路径空间不足导致系统HANG住,最终导致IO的TIMEOUT,并产生了错误;而另外一篇则没有进一步的信息。其中这三次错误对应的版本分别是10.2.0.4 FOR AIX、10.2.0.4 FOR SOLARIS和10.2.0.3 FOR HPUX,这说明这个错误和平台没有关系,但是问题集中在10.2.0.3和10.2.0.4版本上。
根据上面的分析,应该部署操作系统信息监控工具,以便于随时观察系统资源的使用情况,在出现类似的错误可以进行辅助分析。由于这个问题没有出现在10.2.0.5中的记录,因此把数据库升级到10.2.0.5有可能避开这个问题。