最近已经碰到多起客户数据库无法打开的情况,这就是其中一次。
这是一个10201 for Windows 64bit数据库,在一次掉电后,数据库无法启动,在后台告警日志中出现下列的错误:
Mon Oct 31 00:00:28 2011 ALTER DATABASE OPEN Mon Oct 31 00:00:28 2011 Beginning crash recovery OF 1 threads parallel recovery started WITH 16 processes Mon Oct 31 00:00:28 2011 Started redo scan Mon Oct 31 00:00:28 2011 Completed redo scan 12587 redo blocks READ, 161 DATA blocks need recovery Mon Oct 31 00:00:29 2011 Started redo application at Thread 1: logseq 661, block 34225 Mon Oct 31 00:00:29 2011 Recovery OF Online Redo Log: Thread 1 GROUP 3 Seq 661 Reading mem 0 Mem# 0 errs 0: D:\ORACLE\PRODUCT\10.2.0\ORADATA\YCBG\REDO03.LOG Mon Oct 31 00:00:29 2011 Completed redo application Mon Oct 31 00:00:29 2011 Errors IN file e:\oracle\product\10.2.0\admin\ycbg\bdump\ycbg_p002_2196.trc: ORA-00600: internal error code, arguments: [2037], [21598604], [2441871360], [0], [0], [0], [3729260873], [536936448] Mon Oct 31 00:00:29 2011 Errors IN file e:\oracle\product\10.2.0\admin\ycbg\bdump\ycbg_p006_2212.trc: ORA-00600: internal error code, arguments: [2037], [4198106], [249167872], [229], [159], [0], [1456275520], [100740646] . . . Mon Oct 31 00:00:30 2011 Errors IN file e:\oracle\product\10.2.0\admin\ycbg\bdump\ycbg_p002_2196.trc: ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [kcbs_dump_adv_state+1127] [PC:0x60D4E1] [ADDR:0xFFFFFFFFFFFFFFFF] [UNABLE_TO_READ] [] ORA-00600: internal error code, arguments: [2037], [21598604], [2441871360], [0], [0], [0], [3729260873], [536936448] . . . Mon Oct 31 00:00:31 2011 Errors IN file e:\oracle\product\10.2.0\admin\ycbg\bdump\ycbg_p012_2236.trc: ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [ksl_cleanup+1220] [PC:0x49DA9C] [ADDR:0xFFFFFFFFFFFFFFFF] [UNABLE_TO_READ] [] ORA-00081: address range [0x179B63600000, 0x179B63600004) IS NOT readable ORA-00600: internal error code, arguments: [2037], [21649715], [1496514560], [228], [10], [0], [3855679818], [100820580] Mon Oct 31 00:00:31 2011 Errors IN file e:\oracle\product\10.2.0\admin\ycbg\bdump\ycbg_p007_2216.trc: ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [kslwte_tm+754] [PC:0x49FC16] [ADDR:0x82FD2228AB8] [UNABLE_TO_WRITE] [] . . . Mon Oct 31 00:00:35 2011 Errors IN file e:\oracle\product\10.2.0\admin\ycbg\udump\ycbg_ora_2256.trc: ORA-00600: internal error code, arguments: [ksuapc1], [0x110000078], [0x7FFD63F4E90], [131072], [0x7FFD62F4E98], [], [], [] . . . Corrupt block relative dba: 0x014895b9 (file 5, block 562617) Bad header found during crash/instance recovery DATA IN bad block: TYPE: 91 format: 5 rdba: 0x95b90000 LAST CHANGE scn: 0x0229.e7210148 seq: 0x0 flg: 0x00 spare1: 0x6 spare2: 0xa2 spare3: 0x8b1d consistency VALUE IN tail: 0x06010000 CHECK VALUE IN block header: 0x401 block checksum disabled . . . Mon Oct 31 00:01:23 2011 Errors IN file e:\oracle\product\10.2.0\admin\ycbg\bdump\ycbg_ora_2164.trc: ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [ksliwat+3150] [PC:0x4A1D92] [ADDR:0xFFFFFFFFFFFFFFFF] [UNABLE_TO_READ] [] . . . Mon Oct 31 00:01:26 2011 Errors IN file e:\oracle\product\10.2.0\admin\ycbg\bdump\ycbg_reco_2148.trc: ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [kslwte_tm+754] [PC:0x49FC16] [ADDR:0x82FD2228AB8] [UNABLE_TO_WRITE] [] . . . Mon Oct 31 00:09:13 2011 USER: terminating instance due TO error 472 Mon Oct 31 00:09:13 2011 Errors IN file e:\oracle\product\10.2.0\admin\ycbg\udump\ycbg_ora_2772.trc: ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [ksuitm+1655] [PC:0x417C41] . . . Mon Oct 31 00:41:26 2011 Errors IN file e:\oracle\product\10.2.0\admin\ycbg\bdump\ycbg_ora_2080.trc: ORA-00600: internal error code, arguments: [kmcpsched:Bad caller], [], [], [], [], [], [], [] . . . Mon Oct 31 04:58:05 2011 Errors IN file e:\oracle\product\10.2.0\admin\ycbg\bdump\ycbg_p007_2960.trc: ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [kksOnErrorMutexCleanup+53] [PC:0x4A7ED5] [ADDR:0xFFFFFFFFFFFFFFFF] [UNABLE_TO_READ] [] ORA-00081: address range [0x163A63600000, 0x163A63600004) IS NOT readable ORA-00600: internal error code, arguments: [ksfdrmms1], [0x1111B5080], [], [], [], [], [], [] . . . Mon Oct 31 05:04:57 2011 Errors IN file e:\oracle\product\10.2.0\admin\ycbg\bdump\ycbg_p004_3920.trc: ORA-00600: internal error code, arguments: [ksfdchkfob1], [0x1111B8138], [0x1112694580000], [], [], [], [], [] . . . Mon Oct 31 05:04:58 2011 Errors IN file e:\oracle\product\10.2.0\admin\ycbg\bdump\ycbg_mmon_2384.trc: ORA-00600: internal error code, arguments: [KSSRMP2], [0x1112697F8], [0], [0], [], [], [], [] ORA-27041: unable TO OPEN file OSD-04001: 逻辑块大小无效 (OS 33554432) . . . Mon Oct 31 05:15:08 2011 Errors IN file e:\oracle\product\10.2.0\admin\ycbg\bdump\ycbg_p013_2484.trc: ORA-07445: exception encountered: core dump [ACCESS_VIOLATION] [kksOnErrorMutexCleanup+53] [PC:0x4A7ED5] [ADDR:0xFFFFFFFFFFFFFFFF] [UNABLE_TO_READ] [] ORA-00081: address range [0x17F663600000, 0x17F663600004) IS NOT readable ORA-00600: internal error code, arguments: [545], [0x110563848], [0], [16], [], [], [], [] |
这里只是记录了不重复错误,每个错误都会出现很多次,简直是ORA-600和ORA-7445错误的大集合。由于一个错误,一次性引发这么多的不同ORA-600和ORA-7445,我还是第一次碰到。
虽然错误信息很多,但是后面的大部分错误,除了个别几个在MOS中没有记录外,一些错误与不正常的启动有关,一些错误与实例崩溃有关,另一些是与文件损坏有关。也就是说,这些错误的产生只是现象,而问题的主要原因还要从第一个错误上入手。
*** 2011-10-31 00:00:29.259 ksedmp: internal OR fatal error ORA-00600: internal error code, arguments: [2037], [21598604], [2441871360], [0], [0], [0], [3729260873], [536936448] ----- Call Stack Trace ----- calling CALL entry argument VALUES IN hex location TYPE point (? means dubious VALUE) -------------------- -------- -------------------- ---------------------------- ksedst+55 CALL??? ksedst1+573 970A00000000 00001D688 014BBDE10 00001E5F0 ksedmp+663 CALL??? ksedst+55 003A750A8 000000000 014BBC888 000000000 ksfdmp+19 CALL??? ksedmp+663 000000003 013EE4650 013A78A88 003A975A0 kgeriv+184 CALL??? ksfdmp+19 00001E5F0 013EE4010 000AEBD23 27F00001FA0 kgesiv+102 CALL??? kgeriv+184 000000000 000000000 013EE4010 000000000 ksesic7+125 CALL??? kgesiv+102 7FFD2001EC0 232FBC020 400000000 000000000 kcoexam+249 CALL??? ksesic7+125 1000007F5 000000000 00149918C 000000000 kcbtema+647 CALL??? kcoexam+249 01B7C5FA0 10F776000 000000001 000000007 kcrpap+356 CALL??? kcbtema+647 1A1FBB198 7FF9EFC0B88 7FF9EFC0604 013A7A558 kcrpdv+1521 CALL??? kcrpap+356 003E5B9A8 00000000B 000000000 000000000 kxfprdp+1360 CALL??? kcrpdv+1521 000000001 003A7AC60 000000000 000000000 opirip+1250 CALL??? kxfprdp+1360 30315C740000001E 003A782E8 014BBFA70 000000000 opidrv+860 CALL??? opirip+1250 000000032 000000004 014BBFDA0 000000000 sou2o+52 CALL??? opidrv+860 000000032 000000004 014BBFDA0 0068679D2 opimai_real+272 CALL??? sou2o+52 000000000 0139C0000 000000828 000000000 opimai+96 CALL??? opimai_real+272 000000000 000000000 000000000 000000000 BackgroundThreadSta CALL??? opimai+96 014BBFEF0 000000001 000000000 rt+530 000000000 0000000078D3B6DA CALL??? BackgroundThreadSta 0068675C0 000000000 000000000 rt+530 014BBFFA8 --------------------- Binary Stack Dump --------------------- |
查询MOS后发现,文档During Startup (Open Database) Alert Log Shows ORA-600[2037] and ORA-7445[kcbs_dump_adv_state] [ID 551993.1]和当前描述的现象最为接近:不但2037错误和kcbs_dump_adv_state错误信息一致,数据库版本相符,连文档中提到的kcoexam kcbtema kcrpap kcrpdv kxfprdp堆栈信息在TRACE中也可以一点不差的找到。
如果在处理分布式事务的时候使用了IMU(In Memory Undo)技术,那么如果这个时候数据库出现了崩溃,那么很不幸,在Oracle重启的时候尝试实例恢复的过程中,就会导致REDO或UNDO的损坏,从而导致数据库无法打开。而当前数据库的报错集中发生在PNNN进程上,这说明Oracle在启用并行恢复,而错误信息中还可以看到RECO进程,说明Oracle在处理分布式事务的恢复,所有的现象都很好的符合了bug的描述。
这个Bug 4899479影响10.2.0.1到10.2.0.3版本,在10.2.0.4中被FIXED。而解决这个问题的常规手段只有通过备份进行及时点不完全恢复。