客户10.2.0.4 RAC for Solaris 10环境突然出现了实例重启的现象。
数据库正常运行到下午3点左右,随后两个节点分别重启,其中一个节点上的实例无法自动启动。检查两个实例的告警日志发现,在节点重启前,两个节点都出现了明显的ORA-27504错误:
Wed Apr 10 15:00:05 2013 Errors IN file /oracle/admin/orcl/udump/orcl1_ora_10997.trc: ORA-00603: ORACLE server SESSION TERMINATED BY fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0 ORA-27301: OS failure message: Error 0 ORA-27302: failure occurred at: skgxpvaddr9 ORA-27303: additional information: requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command Wed Apr 10 15:00:06 2013 Errors IN file /oracle/admin/orcl/udump/orcl1_ora_11007.trc: ORA-00603: ORACLE server SESSION TERMINATED BY fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0 ORA-27301: OS failure message: Error 0 ORA-27302: failure occurred at: skgxpvaddr9 ORA-27303: additional information: requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command Wed Apr 10 15:00:06 2013 Errors IN file /oracle/admin/orcl/udump/orcl1_ora_11009.trc: ORA-00603: ORACLE server SESSION TERMINATED BY fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0 ORA-27301: OS failure message: Error 0 ORA-27302: failure occurred at: skgxpvaddr9 ORA-27303: additional information: requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command Wed Apr 10 15:00:06 2013 Errors IN file /oracle/admin/orcl/udump/orcl1_ora_11011.trc: ORA-00603: ORACLE server SESSION TERMINATED BY fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0 ORA-27301: OS failure message: Error 0 ORA-27302: failure occurred at: skgxpvaddr9 ORA-27303: additional information: requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command . . . Wed Apr 10 15:07:08 2013 IPC Send timeout detected.Sender: ospid 25688 Receiver: inst 2 binc 427282 ospid 11838 Wed Apr 10 15:07:08 2013 IPC Send timeout detected.Sender: ospid 25724 Wed Apr 10 15:07:08 2013 IPC Send timeout detected.Sender: ospid 25680 Receiver: inst 2 binc 431591 ospid 11822 Receiver: inst 2 binc 431795 ospid 11874 Wed Apr 10 15:07:08 2013 IPC Send timeout detected.Sender: ospid 25684 Receiver: inst 2 binc 428985 ospid 11826 Wed Apr 10 15:07:08 2013 IPC Send timeout detected.Sender: ospid 25708 Receiver: inst 2 binc 430048 ospid 11858 Wed Apr 10 15:07:09 2013 ospid 25678: network interface WITH IP address 192.168.168.3 no longer operational requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command Wed Apr 10 15:07:35 2013 IPC Send timeout TO 1.1 inc 4 FOR msg TYPE 44 FROM opid 7 Wed Apr 10 15:07:35 2013 IPC Send timeout TO 1.12 inc 4 FOR msg TYPE 44 FROM opid 21 Wed Apr 10 15:07:35 2013 IPC Send timeout TO 1.2 inc 4 FOR msg TYPE 44 FROM opid 8 Wed Apr 10 15:07:35 2013 IPC Send timeout TO 1.3 inc 4 FOR msg TYPE 44 FROM opid 10 Wed Apr 10 15:07:35 2013 IPC Send timeout TO 1.8 inc 4 FOR msg TYPE 44 FROM opid 15 Wed Apr 10 15:08:13 2013 ospid 25678: network interface WITH IP address 192.168.168.3 no longer operational requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command Wed Apr 10 15:08:16 2013 IPC Send timeout detected.Sender: ospid 25748 Receiver: inst 2 binc 430164 ospid 11890 . . . Wed Apr 10 15:08:53 2013 IPC Send timeout TO 1.13 inc 4 FOR msg TYPE 36 FROM opid 176 Wed Apr 10 15:08:53 2013 IPC Send timeout TO 1.15 inc 4 FOR msg TYPE 36 FROM opid 167 Wed Apr 10 15:08:57 2013 IPC Send timeout TO 1.4 inc 4 FOR msg TYPE 32 FROM opid 180 . . . Wed Apr 10 15:15:51 2013 Evicting instance 2 FROM cluster Wed Apr 10 15:16:09 2013 ospid 25678: network interface WITH IP address 192.168.168.3 no longer operational requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command Wed Apr 10 15:16:40 2013 Waiting FOR instances TO leave: 2 Wed Apr 10 15:17:00 2013 Waiting FOR instances TO leave: 2 Wed Apr 10 15:17:09 2013 ospid 25678: network interface WITH IP address 192.168.168.3 no longer operational requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command Wed Apr 10 15:17:20 2013 Waiting FOR instances TO leave: 2 |
节点2上的错误信息与之类似:
. . . Wed Apr 10 15:19:07 2013 Errors IN file /oracle/admin/orcl/udump/orcl2_ora_14065.trc: ORA-00603: ORACLE server SESSION TERMINATED BY fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0 ORA-27301: OS failure message: Error 0 ORA-27302: failure occurred at: skgxpvaddr9 ORA-27303: additional information: requested interface 192.168.168.4 NOT found. CHECK output FROM ifconfig command Wed Apr 10 15:19:08 2013 Errors IN file /oracle/admin/orcl/udump/orcl2_ora_14057.trc: ORA-00603: ORACLE server SESSION TERMINATED BY fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0 ORA-27301: OS failure message: Error 0 ORA-27302: failure occurred at: skgxpvaddr9 ORA-27303: additional information: requested interface 192.168.168.4 NOT found. CHECK output FROM ifconfig command Wed Apr 10 15:19:46 2013 ospid 11820: network interface WITH IP address 192.168.168.4 no longer operational requested interface 192.168.168.4 NOT found. CHECK output FROM ifconfig command Wed Apr 10 15:20:46 2013 ospid 11820: network interface WITH IP address 192.168.168.4 no longer operational requested interface 192.168.168.4 NOT found. CHECK output FROM ifconfig command Wed Apr 10 15:20:55 2013 Errors IN file /oracle/admin/orcl/bdump/orcl2_lmon_11818.trc: ORA-29740: evicted BY member 0, GROUP incarnation 6 Wed Apr 10 15:20:55 2013 LMON: terminating instance due TO error 29740 Wed Apr 10 15:20:55 2013 Errors IN file /oracle/admin/orcl/bdump/orcl2_smon_11924.trc: ORA-29740: evicted BY member , GROUP incarnation Wed Apr 10 15:20:55 2013 Errors IN file /oracle/admin/orcl/bdump/orcl2_lmse_11886.trc: ORA-29740: evicted BY member , GROUP incarnation Wed Apr 10 16:11:37 2013 Starting ORACLE instance (normal) Wed Apr 10 16:11:45 2013 sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive Wed Apr 10 16:11:45 2013 sculkget: LOCK held BY PID: 6912 Wed Apr 10 16:11:45 2013 Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance. Wed Apr 10 16:11:45 2013 Failed TO acquire instance startup/shutdown serialization primitive Wed Apr 10 16:11:50 2013 sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive Wed Apr 10 16:11:50 2013 sculkget: LOCK held BY PID: 6912 Wed Apr 10 16:11:50 2013 Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance. Wed Apr 10 16:11:50 2013 Failed TO acquire instance startup/shutdown serialization primitive Wed Apr 10 16:11:54 2013 sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive Wed Apr 10 16:11:54 2013 sculkget: LOCK held BY PID: 6912 Wed Apr 10 16:11:54 2013 Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance. Wed Apr 10 16:11:54 2013 Failed TO acquire instance startup/shutdown serialization primitive Wed Apr 10 16:12:29 2013 sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive Wed Apr 10 16:12:29 2013 sculkget: LOCK held BY PID: 6912 Wed Apr 10 16:12:29 2013 Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance. Wed Apr 10 16:12:29 2013 Failed TO acquire instance startup/shutdown serialization primitive Wed Apr 10 16:12:47 2013 sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive Wed Apr 10 16:12:47 2013 sculkget: LOCK held BY PID: 6912 Wed Apr 10 16:12:47 2013 Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance. Wed Apr 10 16:12:47 2013 Failed TO acquire instance startup/shutdown serialization primitive Wed Apr 10 16:12:52 2013 sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive Wed Apr 10 16:12:52 2013 sculkget: LOCK held BY PID: 6912 Wed Apr 10 16:12:52 2013 Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance. Wed Apr 10 16:12:52 2013 Failed TO acquire instance startup/shutdown serialization primitive Wed Apr 10 16:12:56 2013 sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive Wed Apr 10 16:12:56 2013 sculkget: LOCK held BY PID: 6912 Wed Apr 10 16:12:56 2013 Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance. Wed Apr 10 16:12:56 2013 Failed TO acquire instance startup/shutdown serialization primitive |
导致问题的原因根据错误信息很容易分析出来,节点2上的IP地址被修改,导致心跳通信出现了异常,而节点1试图将节点2踢出集群,但是由于无法和节点2之间进行通信,因此只有等待节点2重启。
检查节点2的操作系统日志:
Apr 10 15:00:04 bj-sst-xhm-3f2-m5k-02 ip: [ID 482227 kern.notice] ip_arp_done: init failed Apr 10 15:07:37 bj-sst-xhm-3f2-m5k-02 Had[4135]: [ID 702911 daemon.notice] VCS CRITICAL V-16-1-50086 CPU usage ON bj-sst-xhm-3f2-m5k-02 IS 92% Apr 10 15:18:41 bj-sst-xhm-3f2-m5k-02 sshd[13485]: [ID 800047 auth.error] error: Failed TO allocate internet-DOMAIN X11 display socket. |
在15点04秒时出现的ip_arp_done: init failed信息,说明设置网卡接口时使用了主机名信息,且主机的IP地址被在线修改。
最后根据HISTORY确认,发现有人通过root登录系统,执行ifconfig –a6来检查IPV6的地址,但是命令敲错,执行了ifconfig –a 6,在a和6之间多了一个空格,导致主机所有的IP地址被设置成0.0.0.0,于是导致了上面的错误。
这再次说明,对于root这种权限用户而言,任何的不小心都可能会导致非常严重的后果。