出现错误的数据库为Oracle 10.2.0.4 RAC for HP-UX。
在RAC的一个节点上出现了ORA-600[1616]错误和ORA-569错误:
Sat Apr 7 03:03:09 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl2_smon_27471.trc: ORA-00569: Failed TO acquire global enqueue. Sat Apr 7 03:04:14 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl2_smon_27471.trc: ORA-00569: Failed TO acquire global enqueue. Sat Apr 7 03:04:15 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl2_smon_27471.trc: ORA-00569: Failed TO acquire global enqueue. . . . Sat Apr 7 03:06:35 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl2_smon_27471.trc: ORA-00569: Failed TO acquire global enqueue. Sat Apr 7 03:07:42 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl2_smon_27471.trc: ORA-00569: Failed TO acquire global enqueue. Sat Apr 7 03:07:42 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl2_mmon_27483.trc: ORA-00600: internal error code, arguments: [1616], [30], [1], [], [], [], [], [] Sat Apr 7 03:07:44 2012 Trace dumping IS performing id=[cdmp_20120407030744] Sat Apr 7 03:07:44 2012 ORA-600 encountered WHEN generating server alert SMG-3503 Sat Apr 7 03:07:57 2012 Reconfiguration started (OLD inc 24, NEW inc 26) List OF nodes: 1 Global Resource Directory frozen * dead instance detected - DOMAIN 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash VALUE bitmaps Non-LOCAL Process blocks cleaned OUT Sat Apr 7 03:08:03 2012 LMS 0: 1 GCS shadows cancelled, 0 closed Sat Apr 7 03:08:03 2012 LMS 1: 1 GCS shadows cancelled, 0 closed SET master node info Submitted ALL remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious ALL grantable enqueues GRANTED Sat Apr 7 03:08:13 2012 Reconfiguration started (OLD inc 26, NEW inc 28) List OF nodes: 0 1 Nested reconfiguration detected. Global Resource Directory frozen Communication channels reestablished * DOMAIN 0 valid = 0 according TO instance 0 Sat Apr 7 03:08:13 2012 Master broadcasted resource hash VALUE bitmaps Non-LOCAL Process blocks cleaned OUT Sat Apr 7 03:08:17 2012 LMS 1: 0 GCS shadows cancelled, 0 closed Sat Apr 7 03:08:17 2012 LMS 0: 0 GCS shadows cancelled, 0 closed SET master node info Sat Apr 7 03:08:39 2012 Submitted ALL remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious ALL grantable enqueues GRANTED Sat Apr 7 03:08:51 2012 LMS 0: 7542 GCS shadows traversed, 4001 replayed Sat Apr 7 03:08:51 2012 LMS 1: 7472 GCS shadows traversed, 4001 replayed Sat Apr 7 03:08:52 2012 LMS 0: 7481 GCS shadows traversed, 4001 replayed Sat Apr 7 03:08:52 2012 LMS 1: 7522 GCS shadows traversed, 4001 replayed Sat Apr 7 03:08:52 2012 LMS 0: 7450 GCS shadows traversed, 4001 replayed . . . Sat Apr 7 03:08:52 2012 LMS 1: 7550 GCS shadows traversed, 4001 replayed Sat Apr 7 03:08:52 2012 LMS 0: 6396 GCS shadows traversed, 3431 replayed Sat Apr 7 03:08:52 2012 LMS 1: 5339 GCS shadows traversed, 2878 replayed Sat Apr 7 03:08:52 2012 Submitted ALL GCS remote-cache requests Sat Apr 7 03:08:52 2012 Instance recovery: looking FOR dead threads Fix WRITE IN gcs resources Sat Apr 7 03:08:52 2012 Beginning instance recovery OF 1 threads Reconfiguration complete Sat Apr 7 03:08:53 2012 parallel recovery started WITH 7 processes Sat Apr 7 03:08:53 2012 Started redo scan Sat Apr 7 03:08:53 2012 Completed redo scan 2550 redo blocks READ, 152 DATA blocks need recovery Sat Apr 7 03:08:54 2012 Started redo application at Thread 1: logseq 75561, block 256542 Sat Apr 7 03:08:54 2012 Recovery OF Online Redo Log: Thread 1 GROUP 1 Seq 75561 Reading mem 0 Mem# 0: +DATA/orcl/onlinelog/group_1.261.611597975 Sat Apr 7 03:08:54 2012 Completed redo application Sat Apr 7 03:08:54 2012 Completed instance recovery at Thread 1: logseq 75561, block 259092, scn 52674829295 149 DATA blocks READ, 137 DATA blocks written, 2550 redo blocks READ Switch log FOR thread 1 TO SEQUENCE 75562 |
虽然ORA-600[1616]错误在MOS中完全没有记载,但是根据错误信息分析,这个ORA-600错误应该是ORA-569错误导致的,而ORA-569的错误显然和另外一个实例的状态有关。可以看到在问题发生不久之后,另外一个节点发生了关闭和启动的操作。
那么检查一下另外一个节点对应时刻的日志:
Thu Apr 5 23:06:35 2012 Thread 1 advanced TO log SEQUENCE 75470 CURRENT log# 6 seq# 75470 mem# 0: +DATA/orcl/onlinelog/group_6.258.611597995 Thu Apr 5 23:59:38 2012 Thread 1 advanced TO log SEQUENCE 75471 CURRENT log# 2 seq# 75471 mem# 0: +DATA/orcl/onlinelog/group_2.260.611597981 Fri Apr 6 00:20:12 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:15 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:15 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:16 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:16 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:17 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:17 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:18 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:18 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:19 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:19 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:20 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:20 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:20 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:21 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:21 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:22 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:22 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable Fri Apr 6 00:20:23 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLLJ.OLD_000 have been marked unusable . . . Fri Apr 6 10:55:20 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLTJ.OLD_000 have been marked unusable Fri Apr 6 10:55:22 2012 SOME indexes OR INDEX [sub]partitions OF TABLE ORCLTJ.OLD_000 have been marked unusable Fri Apr 6 10:58:54 2012 Thread 1 advanced TO log SEQUENCE 75530 CURRENT log# 6 seq# 75530 mem# 0: +DATA/orcl/onlinelog/group_6.258.611597995 Fri Apr 6 11:20:34 2012 Thread 1 advanced TO log SEQUENCE 75531 CURRENT log# 2 seq# 75531 mem# 0: +DATA/orcl/onlinelog/group_2.260.611597981 . . . Fri Apr 6 16:53:40 2012 Thread 1 advanced TO log SEQUENCE 75543 CURRENT log# 2 seq# 75543 mem# 0: +DATA/orcl/onlinelog/group_2.260.611597981 Fri Apr 6 17:07:15 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl1_smon_15533.trc: ORA-00604: error occurred at recursive SQL level 1 ORA-04031: unable TO allocate 4120 bytes OF shared memory ("shared pool","select increment$,minvalue,m...","Typecheck","kgghteInit") Fri Apr 6 17:07:22 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl1_smon_15533.trc: ORA-00604: error occurred at recursive SQL level 1 ORA-04031: unable TO allocate 4120 bytes OF shared memory ("shared pool","select name,online$,contents...","Typecheck","kgghteInit") Fri Apr 6 17:07:23 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl1_smon_15533.trc: ORA-00604: error occurred at recursive SQL level 1 ORA-04031: unable TO allocate 4120 bytes OF shared memory ("shared pool","select name,online$,contents...","Typecheck","kgghteInit") Fri Apr 6 17:07:40 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl1_smon_15533.trc: ORA-00604: error occurred at recursive SQL level 1 ORA-04031: unable TO allocate 4120 bytes OF shared memory ("shared pool","select name,online$,contents...","Typecheck","kgghteInit") Fri Apr 6 17:07:41 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl1_smon_15533.trc: ORA-00604: error occurred at recursive SQL level 1 ORA-04031: unable TO allocate 4120 bytes OF shared memory ("shared pool","select name,online$,contents...","Typecheck","kgghteInit") Fri Apr 6 17:15:53 2012 Thread 1 advanced TO log SEQUENCE 75544 CURRENT log# 5 seq# 75544 mem# 0: +DATA/orcl/onlinelog/group_5.259.611597989 Fri Apr 6 17:34:28 2012 Thread 1 advanced TO log SEQUENCE 75545 CURRENT log# 1 seq# 75545 mem# 0: +DATA/orcl/onlinelog/group_1.261.611597975 . . . Sat Apr 7 03:00:22 2012 Thread 1 advanced TO log SEQUENCE 75561 CURRENT log# 1 seq# 75561 mem# 0: +DATA/orcl/onlinelog/group_1.261.611597975 Sat Apr 7 03:03:18 2012 WARNING: ran OUT OF shared pool FOR GES resource object. Sat Apr 7 03:03:49 2012 WARNING: ran OUT OF shared pool FOR GES resource object. Sat Apr 7 03:04:20 2012 WARNING: ran OUT OF shared pool FOR GES resource object. Sat Apr 7 03:04:51 2012 WARNING: ran OUT OF shared pool FOR GES resource object. Sat Apr 7 03:05:22 2012 WARNING: ran OUT OF shared pool FOR GES resource object. Sat Apr 7 03:05:54 2012 WARNING: ran OUT OF shared pool FOR GES resource object. Sat Apr 7 03:06:24 2012 WARNING: ran OUT OF shared pool FOR GES resource object. Sat Apr 7 03:07:42 2012 WARNING: ran OUT OF shared pool FOR GES resource object. Sat Apr 7 03:07:44 2012 Trace dumping IS performing id=[cdmp_20120407030744] Sat Apr 7 03:07:45 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl1_lmd0_15494.trc: ORA-00600: internal error code, arguments: [kjrpar:noalloc], [1], [385], [0x60000000000C09B8], [0xC0000002137539D8], [0xC000000213752748], [], [] Sat Apr 7 03:07:46 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl1_lmd0_15494.trc: ORA-00600: internal error code, arguments: [kjrpar:noalloc], [1], [385], [0x60000000000C09B8], [0xC0000002137539D8], [0xC000000213752748], [], [] Sat Apr 7 03:07:54 2012 Errors IN file /u01/app/oracle/admin/ORCL/bdump/orcl1_lmd0_15494.trc: ORA-00600: internal error code, arguments: [kjrpar:noalloc], [1], [385], [0x60000000000C09B8], [0xC0000002137539D8], [0xC000000213752748], [], [] Sat Apr 7 03:07:54 2012 LMD0: terminating instance due TO error 482 Sat Apr 7 03:07:54 2012 Shutting down instance (abort) License high water mark = 56 Sat Apr 7 03:07:58 2012 Instance TERMINATED BY LMD0, pid = 15494 Sat Apr 7 03:07:59 2012 Instance TERMINATED BY USER, pid = 8118 Sat Apr 7 03:08:05 2012 Starting ORACLE instance (normal) Sat Apr 7 03:08:05 2012 Specified VALUE OF sga_max_size IS too small, bumping TO 6627000320 LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Interface TYPE 1 lan901 192.168.0.0 configured FROM OCR FOR USE AS a cluster interconnect Interface TYPE 1 lan900 198.5.1.0 configured FROM OCR FOR USE AS a public interface Picked latch-free SCN scheme 3 Autotune OF undo retention IS turned off. LICENSE_MAX_USERS = 0 SYS auditing IS disabled My pid IS 8415, priority IS -179, running WITH SCHED_NOAGE policy 8. ksdpec: called FOR event 13740 prior TO event GROUP initialization Starting up ORACLE RDBMS Version: 10.2.0.3.0. System parameters WITH non-DEFAULT VALUES: processes = 300 sessions = 335 timed_statistics = FALSE sga_max_size = 6627000320 lock_sga = FALSE shared_pool_size = 4294967296 large_pool_size = 67108864 java_pool_size = 33554432 streams_pool_size = 16777216 shared_pool_reserved_size= 263402290 spfile = +DATA/orcl/spfileorcl.ora sga_target = 0 control_files = +DATA/orcl/controlfile/current1.ora, +ORCL_DATA/orcl/controlfile/current2.ora db_block_size = 8192 db_cache_size = 2147483648 db_writer_processes = 2 compatible = 10.2.0.2.0 log_archive_dest_1 = LOCATION=/u01/app/oradata/orcl_1_arch1/arch2tape log_archive_dest_2 = LOCATION=/u01/app/oradata/orcl_1_arch2/arch2tape log_archive_format = ORCL_%t_%s_%r.ARC log_buffer = 52428800 log_checkpoint_timeout = 1800 db_file_multiblock_read_count= 8 cluster_database = TRUE cluster_database_instances= 2 db_create_file_dest = +ORCL_DATA db_create_online_log_dest_1= +DATA thread = 1 fast_start_mttr_target = 600 instance_number = 1 undo_management = AUTO undo_tablespace = UNDOTBS1 undo_retention = 10800 remote_login_passwordfile= EXCLUSIVE db_domain = dispatchers = (PROTOCOL=TCP)(DISPATCHERS=1) max_shared_servers = 32 max_dispatchers = 2 local_listener = remote_listener = LISTENERS_ORCL job_queue_processes = 10 background_dump_dest = /u01/app/oracle/admin/ORCL/bdump user_dump_dest = /u01/app/oracle/admin/ORCL/udump max_dump_file_size = 52428800 core_dump_dest = /u01/app/oracle/admin/ORCL/cdump audit_file_dest = /u01/app/oracle/admin/ORCL/adump hpux_sched_noage = 178 open_links = 30 db_name = ORCL open_cursors = 300 star_transformation_enabled= TRUE pga_aggregate_target = 1073741824 Cluster communication IS configured TO USE the following interface(s) FOR this instance 192.168.0.111 Sat Apr 7 03:08:10 2012 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 PMON started WITH pid=2, OS id=8596 DIAG started WITH pid=4, OS id=8606 PSP0 started WITH pid=6, OS id=8608 LMON started WITH pid=8, OS id=8614 LMD0 started WITH pid=10, OS id=8616 LMS0 started WITH pid=12, OS id=8618 LMS1 started WITH pid=3, OS id=8620 MMAN started WITH pid=14, OS id=8623 DBW0 started WITH pid=16, OS id=8625 DBW1 started WITH pid=5, OS id=8644 LGWR started WITH pid=7, OS id=8657 CKPT started WITH pid=18, OS id=8659 SMON started WITH pid=20, OS id=8661 RECO started WITH pid=22, OS id=8663 CJQ0 started WITH pid=24, OS id=8665 MMON started WITH pid=26, OS id=8667 Sat Apr 7 03:08:12 2012 starting up 1 dispatcher(s) FOR network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'... MMNL started WITH pid=28, OS id=8669 |
从节点1上的告警日志基本上就可以判断出问题的原因了。显然在问题发生的前一天夜里,执行了大量的分区维护操作,致使大量的分区索引失效。这个分区维护的操作导致大量的分区源数据被加载到共享池中,随后不久,Oracle就出现了共享池不足的ORA-4031错误。在节点2出现ORA-569错误的同时,节点1出现了WARNING: ran out of shared pool for GES resource object信息,说明这时节点1上共享池不足导致了节点2上全局队列失败。
而后节点1由于资源不足导致了ORA-600[kjrpar:noalloc]错误的出现,并导致了实例的崩溃,关于这个错误的更详细描述可以参考:https://yangtingkun.net/?p=620。
这两个错误实际上还是由于RAC另外节点上共享池资源不足导致的,要避免这两个错误,还是要解决ORA-4031的错误。