IP地址被清空导致实例重启

客户10.2.0.4 RAC for Solaris 10环境突然出现了实例重启的现象。
数据库正常运行到下午3点左右,随后两个节点分别重启,其中一个节点上的实例无法自动启动。检查两个实例的告警日志发现,在节点重启前,两个节点都出现了明显的ORA-27504错误:

Wed Apr 10 15:00:05 2013
Errors IN file /oracle/admin/orcl/udump/orcl1_ora_10997.trc:
ORA-00603: ORACLE server SESSION TERMINATED BY fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:00:06 2013
Errors IN file /oracle/admin/orcl/udump/orcl1_ora_11007.trc:
ORA-00603: ORACLE server SESSION TERMINATED BY fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:00:06 2013
Errors IN file /oracle/admin/orcl/udump/orcl1_ora_11009.trc:
ORA-00603: ORACLE server SESSION TERMINATED BY fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:00:06 2013
Errors IN file /oracle/admin/orcl/udump/orcl1_ora_11011.trc:
ORA-00603: ORACLE server SESSION TERMINATED BY fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
.
.
.
Wed Apr 10 15:07:08 2013
IPC Send timeout detected.Sender: ospid 25688
Receiver: inst 2 binc 427282 ospid 11838
Wed Apr 10 15:07:08 2013
IPC Send timeout detected.Sender: ospid 25724
Wed Apr 10 15:07:08 2013
IPC Send timeout detected.Sender: ospid 25680
Receiver: inst 2 binc 431591 ospid 11822
Receiver: inst 2 binc 431795 ospid 11874
Wed Apr 10 15:07:08 2013
IPC Send timeout detected.Sender: ospid 25684
Receiver: inst 2 binc 428985 ospid 11826
Wed Apr 10 15:07:08 2013
IPC Send timeout detected.Sender: ospid 25708
Receiver: inst 2 binc 430048 ospid 11858
Wed Apr 10 15:07:09 2013
ospid 25678: network interface WITH IP address 192.168.168.3 no longer operational
requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:07:35 2013
IPC Send timeout TO 1.1 inc 4 FOR msg TYPE 44 FROM opid 7
Wed Apr 10 15:07:35 2013
IPC Send timeout TO 1.12 inc 4 FOR msg TYPE 44 FROM opid 21
Wed Apr 10 15:07:35 2013
IPC Send timeout TO 1.2 inc 4 FOR msg TYPE 44 FROM opid 8
Wed Apr 10 15:07:35 2013
IPC Send timeout TO 1.3 inc 4 FOR msg TYPE 44 FROM opid 10
Wed Apr 10 15:07:35 2013
IPC Send timeout TO 1.8 inc 4 FOR msg TYPE 44 FROM opid 15
Wed Apr 10 15:08:13 2013
ospid 25678: network interface WITH IP address 192.168.168.3 no longer operational
requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:08:16 2013
IPC Send timeout detected.Sender: ospid 25748
Receiver: inst 2 binc 430164 ospid 11890
.
.
.
Wed Apr 10 15:08:53 2013
IPC Send timeout TO 1.13 inc 4 FOR msg TYPE 36 FROM opid 176
Wed Apr 10 15:08:53 2013
IPC Send timeout TO 1.15 inc 4 FOR msg TYPE 36 FROM opid 167
Wed Apr 10 15:08:57 2013
IPC Send timeout TO 1.4 inc 4 FOR msg TYPE 32 FROM opid 180
.
.
.
Wed Apr 10 15:15:51 2013
Evicting instance 2 FROM cluster
Wed Apr 10 15:16:09 2013
ospid 25678: network interface WITH IP address 192.168.168.3 no longer operational
requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:16:40 2013
Waiting FOR instances TO leave: 
2 
Wed Apr 10 15:17:00 2013
Waiting FOR instances TO leave: 
2 
Wed Apr 10 15:17:09 2013
ospid 25678: network interface WITH IP address 192.168.168.3 no longer operational
requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:17:20 2013
Waiting FOR instances TO leave: 
2

节点2上的错误信息与之类似:

.
.
.
Wed Apr 10 15:19:07 2013
Errors IN file /oracle/admin/orcl/udump/orcl2_ora_14065.trc:
ORA-00603: ORACLE server SESSION TERMINATED BY fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 192.168.168.4 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:19:08 2013
Errors IN file /oracle/admin/orcl/udump/orcl2_ora_14057.trc:
ORA-00603: ORACLE server SESSION TERMINATED BY fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 192.168.168.4 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:19:46 2013
ospid 11820: network interface WITH IP address 192.168.168.4 no longer operational
requested interface 192.168.168.4 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:20:46 2013
ospid 11820: network interface WITH IP address 192.168.168.4 no longer operational
requested interface 192.168.168.4 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:20:55 2013
Errors IN file /oracle/admin/orcl/bdump/orcl2_lmon_11818.trc:
ORA-29740: evicted BY member 0, GROUP incarnation 6
Wed Apr 10 15:20:55 2013
LMON: terminating instance due TO error 29740
Wed Apr 10 15:20:55 2013
Errors IN file /oracle/admin/orcl/bdump/orcl2_smon_11924.trc:
ORA-29740: evicted BY member , GROUP incarnation 
Wed Apr 10 15:20:55 2013
Errors IN file /oracle/admin/orcl/bdump/orcl2_lmse_11886.trc:
ORA-29740: evicted BY member , GROUP incarnation 
Wed Apr 10 16:11:37 2013
Starting ORACLE instance (normal)
Wed Apr 10 16:11:45 2013
sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive
Wed Apr 10 16:11:45 2013
sculkget: LOCK held BY PID: 6912
Wed Apr 10 16:11:45 2013
Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance.
Wed Apr 10 16:11:45 2013
Failed TO acquire instance startup/shutdown serialization primitive
Wed Apr 10 16:11:50 2013
sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive
Wed Apr 10 16:11:50 2013
sculkget: LOCK held BY PID: 6912
Wed Apr 10 16:11:50 2013
Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance.
Wed Apr 10 16:11:50 2013
Failed TO acquire instance startup/shutdown serialization primitive
Wed Apr 10 16:11:54 2013
sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive
Wed Apr 10 16:11:54 2013
sculkget: LOCK held BY PID: 6912
Wed Apr 10 16:11:54 2013
Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance.
Wed Apr 10 16:11:54 2013
Failed TO acquire instance startup/shutdown serialization primitive
Wed Apr 10 16:12:29 2013
sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive
Wed Apr 10 16:12:29 2013
sculkget: LOCK held BY PID: 6912
Wed Apr 10 16:12:29 2013
Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance.
Wed Apr 10 16:12:29 2013
Failed TO acquire instance startup/shutdown serialization primitive
Wed Apr 10 16:12:47 2013
sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive
Wed Apr 10 16:12:47 2013
sculkget: LOCK held BY PID: 6912
Wed Apr 10 16:12:47 2013
Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance.
Wed Apr 10 16:12:47 2013
Failed TO acquire instance startup/shutdown serialization primitive
Wed Apr 10 16:12:52 2013
sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive
Wed Apr 10 16:12:52 2013
sculkget: LOCK held BY PID: 6912
Wed Apr 10 16:12:52 2013
Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance.
Wed Apr 10 16:12:52 2013
Failed TO acquire instance startup/shutdown serialization primitive
Wed Apr 10 16:12:56 2013
sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive
Wed Apr 10 16:12:56 2013
sculkget: LOCK held BY PID: 6912
Wed Apr 10 16:12:56 2013
Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance.
Wed Apr 10 16:12:56 2013
Failed TO acquire instance startup/shutdown serialization primitive

导致问题的原因根据错误信息很容易分析出来,节点2上的IP地址被修改,导致心跳通信出现了异常,而节点1试图将节点2踢出集群,但是由于无法和节点2之间进行通信,因此只有等待节点2重启。
检查节点2的操作系统日志:

Apr 10 15:00:04 bj-sst-xhm-3f2-m5k-02 ip: [ID 482227 kern.notice] ip_arp_done: init failed
Apr 10 15:07:37 bj-sst-xhm-3f2-m5k-02 Had[4135]: [ID 702911 daemon.notice] VCS CRITICAL V-16-1-50086 CPU usage ON bj-sst-xhm-3f2-m5k-02 IS 92%
Apr 10 15:18:41 bj-sst-xhm-3f2-m5k-02 sshd[13485]: [ID 800047 auth.error] error: Failed TO allocate internet-DOMAIN X11 display socket.

在15点04秒时出现的ip_arp_done: init failed信息,说明设置网卡接口时使用了主机名信息,且主机的IP地址被在线修改。
最后根据HISTORY确认,发现有人通过root登录系统,执行ifconfig –a6来检查IPV6的地址,但是命令敲错,执行了ifconfig –a 6,在a和6之间多了一个空格,导致主机所有的IP地址被设置成0.0.0.0,于是导致了上面的错误。
这再次说明,对于root这种权限用户而言,任何的不小心都可能会导致非常严重的后果。

This entry was posted in ORACLE and tagged , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *