密码延迟验证导致的系统HANG住

又是一个11g新特性导致的问题。
这个新特性很早之前就研究过,也在其他客户处碰到过类似的问题。从11g开始,如果一个用户使用不正确的密码尝试登录数据库,那么随着登录失败次数的增加,每次登录验证前延迟等待的时间也会增加:

SQL> SET TIME ON
18:30:54 SQL> 
18:30:58 SQL> conn test/test
Connected.
18:31:25 SQL> 
18:31:25 SQL> conn test/a
conn test/a
conn test/a
conn test/a
conn test/a
conn test/a
conn test/a
conn test/test
conn test/a
ERROR:
ORA-01017: invalid username/password; logon denied
Warning: You are no longer connected TO ORACLE.
18:31:26 SQL> ERROR:
ORA-01017: invalid username/password; logon denied
18:31:26 SQL> ERROR:
ORA-01017: invalid username/password; logon denied
18:31:26 SQL> ERROR:
ORA-01017: invalid username/password; logon denied
18:31:27 SQL> ERROR:
ORA-01017: invalid username/password; logon denied
18:31:29 SQL> ERROR:
ORA-01017: invalid username/password; logon denied
18:31:32 SQL> ERROR:
ORA-01017: invalid username/password; logon denied
18:31:36 SQL> Connected.
18:31:36 SQL> ERROR:
ORA-01017: invalid username/password; logon denied
Warning: You are no longer connected TO ORACLE.
18:31:36 SQL>

可以看到,从第三次密码错误的登录开始,每次延迟时间开始变成2秒、3秒并一次递增。既是这时提供正确的密码登录,会话也会延迟N秒,然后进行验证。不过一旦验证成功,会将失败计数清零,后续的错误登录会重新计数。
不过这只是单一会话尝试失败登录的情况,如果同时存在两个会话,则很快延迟验证时间就会达到10秒、20秒的级别。如果同时大量的连接采用错误的密码,基本上这个用户的登录就会被完全HANG住。
客户的数据库就出现了类似的情况,数据库版本为11.2.0.3 RAC,在数据库中观察,三个节点每个节点的会话数都接近SESSIONS参数设置的上线3000,而后台高级日志已经出现了ORA-20错误。由于客户系统的关键用户只有一个,因此几乎所有的会话都无法正常的登录到数据库中。而在数据库上发现,大量的会话用户名、EVENT以及PROGRAM都信息都是NULL,这说明这些会话还没有完成验证成功的登录到数据库中。而当前主机的CPU资源使用并不高,那些已经连接到数据库中的进程也可以正常的工作。尝试使用SYSTEM等其他用户发现可以迅速的登录数据库。所有这一切都已经说明,当前有一个或多个中间件服务器在使用错误的密码连接数据库,由于密码延迟验证的策略,导致所有后续的连接都被HANG住。
任何一个新特性带来性能或功能上的提高的同时,也会引入相关的bug,显然这个安全性上的考虑,有时候也会带来验证的性能问题,甚至成为用来攻击数据库的一种手段。
之前几次并没有给出彻底屏蔽密码延迟验证的手段,而Oracle最强大之处就在于几乎所有的功能和特性都有对应的开关,通过设置EVENTS 28401可以屏蔽密码延迟验证:

SQL> ALTER SYSTEM SET EVENT =28401 TRACE NAME CONTEXT FOREVER, LEVEL 1’ SCOPE = SPFILE;

设置该事件后重启数据库即可。

Posted in ORACLE | Tagged , , , , | Leave a comment

IP地址被清空导致实例重启

客户10.2.0.4 RAC for Solaris 10环境突然出现了实例重启的现象。
数据库正常运行到下午3点左右,随后两个节点分别重启,其中一个节点上的实例无法自动启动。检查两个实例的告警日志发现,在节点重启前,两个节点都出现了明显的ORA-27504错误:

Wed Apr 10 15:00:05 2013
Errors IN file /oracle/admin/orcl/udump/orcl1_ora_10997.trc:
ORA-00603: ORACLE server SESSION TERMINATED BY fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:00:06 2013
Errors IN file /oracle/admin/orcl/udump/orcl1_ora_11007.trc:
ORA-00603: ORACLE server SESSION TERMINATED BY fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:00:06 2013
Errors IN file /oracle/admin/orcl/udump/orcl1_ora_11009.trc:
ORA-00603: ORACLE server SESSION TERMINATED BY fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:00:06 2013
Errors IN file /oracle/admin/orcl/udump/orcl1_ora_11011.trc:
ORA-00603: ORACLE server SESSION TERMINATED BY fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
.
.
.
Wed Apr 10 15:07:08 2013
IPC Send timeout detected.Sender: ospid 25688
Receiver: inst 2 binc 427282 ospid 11838
Wed Apr 10 15:07:08 2013
IPC Send timeout detected.Sender: ospid 25724
Wed Apr 10 15:07:08 2013
IPC Send timeout detected.Sender: ospid 25680
Receiver: inst 2 binc 431591 ospid 11822
Receiver: inst 2 binc 431795 ospid 11874
Wed Apr 10 15:07:08 2013
IPC Send timeout detected.Sender: ospid 25684
Receiver: inst 2 binc 428985 ospid 11826
Wed Apr 10 15:07:08 2013
IPC Send timeout detected.Sender: ospid 25708
Receiver: inst 2 binc 430048 ospid 11858
Wed Apr 10 15:07:09 2013
ospid 25678: network interface WITH IP address 192.168.168.3 no longer operational
requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:07:35 2013
IPC Send timeout TO 1.1 inc 4 FOR msg TYPE 44 FROM opid 7
Wed Apr 10 15:07:35 2013
IPC Send timeout TO 1.12 inc 4 FOR msg TYPE 44 FROM opid 21
Wed Apr 10 15:07:35 2013
IPC Send timeout TO 1.2 inc 4 FOR msg TYPE 44 FROM opid 8
Wed Apr 10 15:07:35 2013
IPC Send timeout TO 1.3 inc 4 FOR msg TYPE 44 FROM opid 10
Wed Apr 10 15:07:35 2013
IPC Send timeout TO 1.8 inc 4 FOR msg TYPE 44 FROM opid 15
Wed Apr 10 15:08:13 2013
ospid 25678: network interface WITH IP address 192.168.168.3 no longer operational
requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:08:16 2013
IPC Send timeout detected.Sender: ospid 25748
Receiver: inst 2 binc 430164 ospid 11890
.
.
.
Wed Apr 10 15:08:53 2013
IPC Send timeout TO 1.13 inc 4 FOR msg TYPE 36 FROM opid 176
Wed Apr 10 15:08:53 2013
IPC Send timeout TO 1.15 inc 4 FOR msg TYPE 36 FROM opid 167
Wed Apr 10 15:08:57 2013
IPC Send timeout TO 1.4 inc 4 FOR msg TYPE 32 FROM opid 180
.
.
.
Wed Apr 10 15:15:51 2013
Evicting instance 2 FROM cluster
Wed Apr 10 15:16:09 2013
ospid 25678: network interface WITH IP address 192.168.168.3 no longer operational
requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:16:40 2013
Waiting FOR instances TO leave: 
2 
Wed Apr 10 15:17:00 2013
Waiting FOR instances TO leave: 
2 
Wed Apr 10 15:17:09 2013
ospid 25678: network interface WITH IP address 192.168.168.3 no longer operational
requested interface 192.168.168.3 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:17:20 2013
Waiting FOR instances TO leave: 
2

节点2上的错误信息与之类似:

.
.
.
Wed Apr 10 15:19:07 2013
Errors IN file /oracle/admin/orcl/udump/orcl2_ora_14065.trc:
ORA-00603: ORACLE server SESSION TERMINATED BY fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 192.168.168.4 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:19:08 2013
Errors IN file /oracle/admin/orcl/udump/orcl2_ora_14057.trc:
ORA-00603: ORACLE server SESSION TERMINATED BY fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if_not_found failed WITH STATUS: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpvaddr9
ORA-27303: additional information: requested interface 192.168.168.4 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:19:46 2013
ospid 11820: network interface WITH IP address 192.168.168.4 no longer operational
requested interface 192.168.168.4 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:20:46 2013
ospid 11820: network interface WITH IP address 192.168.168.4 no longer operational
requested interface 192.168.168.4 NOT found. CHECK output FROM ifconfig command
Wed Apr 10 15:20:55 2013
Errors IN file /oracle/admin/orcl/bdump/orcl2_lmon_11818.trc:
ORA-29740: evicted BY member 0, GROUP incarnation 6
Wed Apr 10 15:20:55 2013
LMON: terminating instance due TO error 29740
Wed Apr 10 15:20:55 2013
Errors IN file /oracle/admin/orcl/bdump/orcl2_smon_11924.trc:
ORA-29740: evicted BY member , GROUP incarnation 
Wed Apr 10 15:20:55 2013
Errors IN file /oracle/admin/orcl/bdump/orcl2_lmse_11886.trc:
ORA-29740: evicted BY member , GROUP incarnation 
Wed Apr 10 16:11:37 2013
Starting ORACLE instance (normal)
Wed Apr 10 16:11:45 2013
sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive
Wed Apr 10 16:11:45 2013
sculkget: LOCK held BY PID: 6912
Wed Apr 10 16:11:45 2013
Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance.
Wed Apr 10 16:11:45 2013
Failed TO acquire instance startup/shutdown serialization primitive
Wed Apr 10 16:11:50 2013
sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive
Wed Apr 10 16:11:50 2013
sculkget: LOCK held BY PID: 6912
Wed Apr 10 16:11:50 2013
Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance.
Wed Apr 10 16:11:50 2013
Failed TO acquire instance startup/shutdown serialization primitive
Wed Apr 10 16:11:54 2013
sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive
Wed Apr 10 16:11:54 2013
sculkget: LOCK held BY PID: 6912
Wed Apr 10 16:11:54 2013
Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance.
Wed Apr 10 16:11:54 2013
Failed TO acquire instance startup/shutdown serialization primitive
Wed Apr 10 16:12:29 2013
sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive
Wed Apr 10 16:12:29 2013
sculkget: LOCK held BY PID: 6912
Wed Apr 10 16:12:29 2013
Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance.
Wed Apr 10 16:12:29 2013
Failed TO acquire instance startup/shutdown serialization primitive
Wed Apr 10 16:12:47 2013
sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive
Wed Apr 10 16:12:47 2013
sculkget: LOCK held BY PID: 6912
Wed Apr 10 16:12:47 2013
Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance.
Wed Apr 10 16:12:47 2013
Failed TO acquire instance startup/shutdown serialization primitive
Wed Apr 10 16:12:52 2013
sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive
Wed Apr 10 16:12:52 2013
sculkget: LOCK held BY PID: 6912
Wed Apr 10 16:12:52 2013
Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance.
Wed Apr 10 16:12:52 2013
Failed TO acquire instance startup/shutdown serialization primitive
Wed Apr 10 16:12:56 2013
sculkget: failed TO LOCK /oracle/products/10.2/db_1/dbs/lkinstorcl2 exclusive
Wed Apr 10 16:12:56 2013
sculkget: LOCK held BY PID: 6912
Wed Apr 10 16:12:56 2013
Oracle Instance Startup operation failed. Another process may be attempting TO startup OR shutdown this Instance.
Wed Apr 10 16:12:56 2013
Failed TO acquire instance startup/shutdown serialization primitive

导致问题的原因根据错误信息很容易分析出来,节点2上的IP地址被修改,导致心跳通信出现了异常,而节点1试图将节点2踢出集群,但是由于无法和节点2之间进行通信,因此只有等待节点2重启。
检查节点2的操作系统日志:

Apr 10 15:00:04 bj-sst-xhm-3f2-m5k-02 ip: [ID 482227 kern.notice] ip_arp_done: init failed
Apr 10 15:07:37 bj-sst-xhm-3f2-m5k-02 Had[4135]: [ID 702911 daemon.notice] VCS CRITICAL V-16-1-50086 CPU usage ON bj-sst-xhm-3f2-m5k-02 IS 92%
Apr 10 15:18:41 bj-sst-xhm-3f2-m5k-02 sshd[13485]: [ID 800047 auth.error] error: Failed TO allocate internet-DOMAIN X11 display socket.

在15点04秒时出现的ip_arp_done: init failed信息,说明设置网卡接口时使用了主机名信息,且主机的IP地址被在线修改。
最后根据HISTORY确认,发现有人通过root登录系统,执行ifconfig –a6来检查IPV6的地址,但是命令敲错,执行了ifconfig –a 6,在a和6之间多了一个空格,导致主机所有的IP地址被设置成0.0.0.0,于是导致了上面的错误。
这再次说明,对于root这种权限用户而言,任何的不小心都可能会导致非常严重的后果。

Posted in ORACLE | Tagged , , , , , , , , , , , | Leave a comment

ORA-7445(kjbcrcomplete)错误

客户10.2.0.5 RAC在验证表的逻辑结构是出现ORA-7445错误。
错误信息如下:

Sun Mar 31 03:45:16 EAT 2013
Errors IN file /oracle/app/admin/orcl/udump/orcl2_ora_4039.trc:
ORA-07445: 出现异常错误: 核心转储 [kjbcrcomplete()+5521] [SIGSEGV] [Invalid permissions FOR mapped object] [0x00000002A] [] []

详细TRACE信息为:

Ioctl ASYNC_CONFIG error, errno = 1
*** 2013-03-31 02:20:45.846
*** ACTION NAME:() 2013-03-31 02:20:45.846
*** MODULE NAME:(sqlplus@db2 (TNS V1-V3)) 2013-03-31 02:20:45.846
*** SERVICE NAME:(SYS$USERS) 2013-03-31 02:20:45.846
*** SESSION ID:(2053.926) 2013-03-31 02:20:45.846
WARNING:Could NOT increase the asynch I/O LIMIT TO 32 FOR SQL direct I/O. It IS SET TO 0
*** 2013-03-31 03:45:16.545
Exception signal: 11 (SIGSEGV), code: 2 (Invalid permissions FOR mapped object), addr: 0x2a, PC: [0x40000000053973d1, kjbcrcomplete()+5521]
  r1: 60000000000ba268       r20:                0       br5:                0
  r2: c0000030f2636d20       r21:               21       br6: c00000000042a870
  r3: c000000028da7000       r22:                0       br7: c00000000043d720
  r4:                0       r23:         5c3412c0        ip: 40000000053973d1
  r5: c000000000000408       r24: c0000030f598d9e8      iipa:                0
  r6: c0000000000443e0       r25: 60000000000ac688       cfm:             14b1
  r7: 9fffffffbf7f8de8       r26: 60000000000ca6b8        um:               1a
  r8:                0       r27:                1       rsc:               1f
  r9: c00000145dfc536c       r28: 60000000000ac650       bsp: 9fffffffbf801600
 r10: 60000000000ca6c0       r29: c0000030f6695ec8  bspstore: 9fffffffbf801600
 r11:               20       r30: 9fffffffbf372318      rnat:                0
 r12: 9ffffffffffd9360       r31:               20       ccv: 2000000000000030
 r13: 9fffffffbf3fd4b0      NaTs:                0      unat:                0
 r14: 60000000000ac650       PRs: c000000000398309      fpsr:    9804c8a76433f
 r15:         5c3412cd       br0: 40000000053964e0       pfs: c0000000000014b1
 r16:                0       br1: c000000000299260        lc:                0
 r17: 60000000000ca6c0       br2: c00000000029ba60        ec:                0
 r18:               20       br3:                0       isr: 9fffffffbf801600
 r19: 9ffffffffffd9330       br4:                0       ifa:                0
Reason code: 0053
*** 2013-03-31 03:45:16.590
ksedmp: internal OR fatal error
ORA-07445: 出现异常错误: 核心转储 [kjbcrcomplete()+5521] [SIGSEGV] [Invalid permissions FOR mapped object] [0x00000002A] [] []
CURRENT SQL statement FOR this SESSION:
analyze TABLE c_inter VALIDATE STRUCTURE CASCADE
----- Call Stack Trace -----
calling              CALL     entry                argument VALUES IN hex      
location             TYPE     point                (? means dubious VALUE)     
-------------------- -------- -------------------- ----------------------------
ksedst()+64          CALL     ksedst1()            000000001 ? 000000001 ?
ksedmp()+2176        CALL     ksedst()             000000001 ?
                                                   C000000000000D20 ?
                                                   4000000004032B40 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ?
ssexhd()+1264        CALL     ksedmp()             000000003 ?
                                                   6000000000247DA0 ?
                                                   60000000000BA268 ?
                                                   6000000000248370 ?
                                                   C000000000000B9F ?
                                                   4000000006C1DF80 ?
                                                   C00000000039B6CD ?
                                                   60000000000C7420 ?
<kernel>             CALL     ssexhd()             C0000030F54A95F0 ?
                                                   60000000000C9570 ?
                                                   C000000028DAE8C8 ?
                                                   60000000000BA268 ?
kjbcrcomplete()+552  CALL     <kernel>             600000000024C200 ?
1                                                  20000000B ?
                                                   600000000024C010 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ?
kclwcrs()+69392      CALL     kjbcrcomplete()      0001703A6 ? 0001F0000 ?
                                                   0000188C5 ?
                                                   C0000030F0307B48 ?
                                                   000000001 ?
                                                   C0000030F0307B48 ?
                                                   000000000 ?
                                                   9FFFFFFFFFFD9390 ?
kclgclk()+21824      CALL     kclwcrs()            C00000145DFC537A ?
                                                   000000800 ?
                                                   60000000000CAA68 ?
                                                   60000000000C6F68 ?
                                                   9FFFFFFFFFFE0A10 ?
                                                   9FFFFFFFFFFE0958 ?
                                                   9FFFFFFFFFFE0D10 ?
                                                   9FFFFFFFFFFE2460 ?
$cold_kcbzib()+8640  CALL     kclgclk()            9FFFFFFFFFFDBDD8 ?
                                                   000000800 ?
                                                   60000000000CAA68 ?
                                                   60000000000C8A38 ?
                                                   000000040 ?
                                                   9FFFFFFFFFFE0D10 ?
                                                   00000026D ?
                                                   9FFFFFFFFFFE0A10 ?
kcbgtcr()+9536       CALL     $cold_kcbzib()       C00000308A416408 ?
                                                   9FFFFFFFFFFE0D10 ?
                                                   9FFFFFFFFFFE0A10 ?
                                                   000000004 ? 000000003 ?
                                                   00000026D ? 000000000 ?
                                                   000000000 ?
ktrgtc()+1120        CALL     kcbgtcr()            9FFFFFFFFFFE0D10 ?
                                                   9FFFFFFFFFFE0A10 ?
                                                   00000026D ? 000000000 ?
                                                   60000000000BA268 ?
kdifbk()+6608        CALL     ktrgtc()             9FFFFFFFFFFE0D00 ?
                                                   9FFFFFFFFFFE2430 ?
                                                   4000000001CA99C0 ?
                                                   9FFFFFFFFFFE0C20 ?
                                                   00000026D ? 000100000 ?
                                                   4000000001CA99E0 ?
kdgvsp()+15888       CALL     kdifbk()             9FFFFFFFFFFE0C44 ?
                                                   9FFFFFFFFFFE0D2C ?
                                                   000000001 ?
                                                   9FFFFFFFFFFE0DE8 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ?
                                                   9FFFFFFFFFFE1320 ?
kdgvss()+432         CALL     kdgvsp()             9FFFFFFFFFFE8DB8 ?
                                                   000000000 ?
                                                   9FFFFFFFFFFE2468 ?
                                                   000000000 ? 00C100000 ?
                                                   C0000030FA5CBF20 ?
kdavls()+6624        CALL     kdgvss()             9FFFFFFFBF2F3780 ?
                                                   000000001 ?
                                                   9FFFFFFFFFFE9C80 ?
                                                   9FFFFFFFFFFE9170 ?
                                                   4000000001CAB230 ?
                                                   C0000030FA5CBF20 ?
                                                   9FFFFFFFBF2F3780 ?
                                                   000000001 ?
kkndgd()+2960        CALL     kdavls()             9FFFFFFFFFFEB4C0 ?
                                                   4000000006557450 ?
                                                   C000000000390061 ?
                                                   9FFFFFFFFFFE8FB4 ?
                                                   000000001 ?
                                                   60000000000AE7A8 ?
                                                   000000001 ?
                                                   9FFFFFFFBF2F3BB4 ?
kknpat()+1648        CALL     kkndgd()             C000002F2B449B18 ?
                                                   0000142B9 ? 000000002 ?
                                                   00000000A ?
                                                   C000002F2B4498DE ?
                                                   000000014 ?
                                                   C000002F2B4498A6 ?
                                                   9FFFFFFFFFFEBE40 ?
kknpob()+448         CALL     kknpat()             C000002F2B449B18 ?
                                                   9FFFFFFFFFFEBE40 ?
                                                   000000001 ?
                                                   C000002F2B6BC670 ?
                                                   9FFFFFFFFFFF2F10 ?
                                                   9FFFFFFFFFFEBD60 ?
                                                   C000000000001838 ?
                                                   400000000655AF30 ?
kknls()+1872         CALL     kknpob()             C000002F2B449B18 ?
                                                   9FFFFFFFFFFEBE40 ?
                                                   9FFFFFFFFFFEC1B0 ?
                                                   FFFFFFFFFFFFFFFF ?
                                                   000100000 ? 000100000 ?
                                                   60000000000BA268 ?
                                                   C0000029B20453AF ?
kkndrv()+64          CALL     kknls()              000000000 ?
$cold_opiexe()+7600  CALL     kkndrv()             C000000000002450 ?
                                                   4000000003637C80 ?
                                                   60000000000CAA68 ?
                                                   60000000000C8A38 ?
                                                   000000040 ?
                                                   9FFFFFFFFFFE0D10 ?
                                                   00000026D ?
                                                   9FFFFFFFFFFE0A10 ?
opiosq0()+8144       CALL     $cold_opiexe()       9FFFFFFFFFFF6130 ?
                                                   4000000002F89200 ?
                                                   00002821B ?
                                                   9FFFFFFFFFFF44B0 ?
                                                   60000000000BA268 ?
                                                   C000000000001838 ?
                                                   9FFFFFFFFFFF44B4 ?
                                                   60000000000C6CA0 ?
kpooprx()+416        CALL     opiosq0()            000000003 ?
                                                   9FFFFFFFFFFF6D90 ?
                                                   4000000002AEB2A0 ?
                                                   00002F21B ?
                                                   C000000000000815 ?
kpoal8()+1152        CALL     kpooprx()            000000003 ?
                                                   9FFFFFFFFFFF9AD0 ?
                                                   000000048 ?
                                                   9FFFFFFFFFFF6DD0 ?
                                                   000000001 ? 0000000A4 ?
                                                   60000000000BA268 ?
                                                   60000000000A7E20 ?
opiodr()+2144        CALL     kpoal8()             9FFFFFFFFFFF7590 ?
                                                   C0000000000018B7 ?
                                                   9FFFFFFFFFFF9C70 ?
                                                   9FFFFFFFFFFF6EB0 ?
                                                   60000000000BA268 ?
                                                   4000000002F33E40 ?
ttcpip()+1680        CALL     opiodr()             00000005E ? 000000017 ?
                                                   4000000001BF80B0 ?
                                                   0000046C0 ?
                                                   9FFFFFFFFFFF75A0 ?
opitsk()+2368        CALL     ttcpip()             600000000003DF40 ?
                                                   000000001 ?
                                                   9FFFFFFFFFFF9C70 ?
                                                   000000001 ?
                                                   9FFFFFFFFFFF9DE0 ?
                                                   9FFFFFFFFFFF9BD4 ?
                                                   4000000001CE0810 ?
                                                   000000000 ?
opiino()+1664        CALL     opitsk()             000000000 ? 000000000 ?
                                                   60000000000BA268 ?
                                                   400000000293B500 ?
                                                   000028089 ?
                                                   4000000001BF80C8 ?
opiodr()+2144        CALL     opiino()             00000003C ?
                                                   9FFFFFFFFFFFC630 ?
                                                   9FFFFFFFFFFFEDD0 ?
                                                   9FFFFFFFFFFFBAF0 ?
                                                   60000000000BA268 ?
                                                   C0000000000018B7 ?
opidrv()+1248        CALL     opiodr()             00000003C ? 000000004 ?
                                                   4000000001BF7B60 ?
                                                   0000046C0 ?
                                                   9FFFFFFFFFFFC640 ?
                                                   60000000000BA268 ?
sou2o()+240          CALL     opidrv()             00000003C ?
                                                   60000000000C6C98 ?
                                                   9FFFFFFFFFFFEDD0 ?
opimai_real()+496    CALL     sou2o()              9FFFFFFFFFFFEDF0 ?
                                                   00000003C ? 000000004 ?
                                                   9FFFFFFFFFFFEDD0 ?
main()+240           CALL     opimai_real()        000000000 ?
                                                   9FFFFFFFFFFFEE20 ?
main_opd_entry()+80  CALL     main()               000000002 ?
                                                   9FFFFFFFFFFFF2D8 ?
                                                   60000000000BA268 ?
                                                   C000000000000004 ?
--------------------- Binary Stack Dump ---------------------

在MOS中找不到这个ORA-7445[kjbcrcomplete]错误的记录,但是从详细的TRACE可以判断,问题导致的原因是验证表的结构时出现了异常。显然是表或索引上存在逻辑坏块,导致Oracle在验证其逻辑结构时发现异常。
解决这个问题的方式是通过逻辑方式重建表和索引。

Posted in BUG | Tagged , , , | Leave a comment

Streams AQ: qmn coordinator waiting for slave to start等待

一个客户的10.2.0.5数据库的TOP 5等待出现了这个等待事件。

其实导致这个Streams AQ: qmn coordinator waiting for slave to start等待出现的原因除了参数设置不合理外,最主要的原因还是数据库太闲了:

Event

Waits

Time(s)

Avg Wait(ms)

% Total Call   Time

Wait Class

CPU time

372

59.0

Streams AQ: qmn   coordinator waiting for slave to start

6

34

5,667

5.4

Other

db file   scattered read

50,528

28

1

4.4

User I/O

gc cr multi   block request

66,347

24

0

3.8

Cluster

db file   sequential read

7,157

18

2

2.8

User I/O

 

可以看到,这个Streams AQ: qmn coordinator waiting for slave to start等待排在TOP 5的第二位,仅仅等待了34秒。不过这个等待的平均等待时间则达到了5秒以上,相比总的等待时间,这个对单次操作的性能影响更加明显。

导致这个问题的主要原因是初始化参数AQ_TM_PROCESSES设置为0,而Oracle推荐改参数应该至少设置为1。由于Oracle的高级队列、流、数据泵等多种内置功能都会依赖于QMN进程,当AQ_TM_PROCESSES不为0时,Oracle无法自动根据负载确定QNNN进程的数量,来满足队列相关操作的需求。而如果该参数被禁止,Oracle只能在出现队列需求的时候才被动的去启动一个SLAVE进程,从而导致了较高的平均等待时间。

 

Posted in ORACLE | Tagged , , | Leave a comment

ORA-600(13310)错误

客户10.2.0.4 RAC环境出现ORA-600[13310]错误。
错误信息如下:

Sun Aug  1 04:08:24 2010
Errors IN file /oracle/admin/ORCL/udump/orcl1_ora_14964.trc:
ORA-00603: ORACLE server SESSION TERMINATED BY fatal error
ORA-27544: Failed TO map memory region FOR export
ORA-27300: OS system dependent operation:socket failed WITH STATUS: 23
ORA-27301: OS failure message: File TABLE overflow
ORA-27302: failure occurred at: sskgxpcre1
Sun Aug  1 04:08:24 2010
Trace dumping IS performing id=[cdmp_20100801040824]
Sun Aug  1 16:08:25 2010
Trace dumping IS performing id=[cdmp_20100801160825]
Sun Aug  1 16:08:27 2010
Errors IN file /oracle/admin/ORCL/bdump/orcl1_diag_17090.trc:
ORA-27050: FUNCTION called WITH invalid FIB/IOV STRUCTURE
Additional information: 2
ORA-27041: unable TO OPEN file
HPUX-ia64 Error: 23: File TABLE overflow
Additional information: 3
Sun Aug  1 04:08:42 2010
Errors IN file /oracle/admin/ORCL/udump/orcl1_ora_15065.trc:
ORA-00603: Message 603 NOT found; No message file FOR product=RDBMS, facility=ORA
ORA-27544: Message 27544 NOT found; No message file FOR product=RDBMS, facility=ORA
ORA-27300: Message 27300 NOT found; No message file FOR product=RDBMS, facility=ORA; arguments: [socket] [23]
ORA-27301: Message 27301 NOT found; No message file FOR product=RDBMS, facility=ORA; arguments: [File TABLE overflow]
ORA-27302: Message 27302 NOT found; No message file FOR product=RDBMS, facility=ORA; arguments: [sskgxpcre1]
Sun Aug  1 04:08:48 2010
Trace dumping IS performing id=[cdmp_20100801160843]
Sun Aug  1 04:08:48 2010
Trace dumping IS performing id=[cdmp_20100801160843]
Sun Aug  1 04:08:56 2010
ospid 17096: network interface WITH IP address 192.168.1.81 no longer operational
Sun Aug  1 04:08:58 2010
Errors IN file /oracle/admin/ORCL/bdump/orcl1_ora_15148.trc:
ORA-00600: internal error code, arguments: [13310], [], [], [], [], [], [], []
Sun Aug  1 04:08:58 2010
Errors IN file /oracle/admin/ORCL/bdump/orcl1_ora_15148.trc:
ORA-00210: cannot OPEN the specified control file
ORA-00202: control file: '/dev/vgpmcs11/rr01G1001'
ORA-27041: unable TO OPEN file
HPUX-ia64 Error: 23: File TABLE overflow
Additional information: 3
ORA-00600: internal error code, arguments: [13310], [], [], [], [], [], [], []
Sun Aug  1 04:09:03 2010
Trace dumping IS performing id=[cdmp_20100801040859]

根据MOS文档ORA-600 [13310] Reported When Obtaining An OS Unique Identifier [ID 1234083.1],ORA-600[13310]错误出现的原因是无法获取操作系统上的唯一标识信息,而Oracle文档给出的解决方案是释放或者增加系统资源,比如内存或SWAP空间。
由于出现600错误之前出现了ORA-2730N的错误,说明确实出现了操作系统层面的问题。根据ORA-27544错误以及File table overflow信息进行定位,确认是HP-UX上10.2.0.4的bug:hp-ux: File handles not released after upgrade to 10.2.0.3 CRS Bundle#2 or 10.2.0.4 [ID 739557.1]。
这个问题是10.2.0.3的CRS Bundle补丁引入,由mmap引发的错误,影响版本包括10.2.0.3、10.2.0.4和11.1.0.6。Oracle在10.2.0.5、11.1.0.7中解决了这个问题,在10.2.0.4 CRS Bundle Patch #2以及10.2.0.4最新的CRS PSU中同样修正了该错误。

Posted in BUG | Tagged , , , , , , , , , , , , , | Leave a comment

ORA-600(17147)和ORA-7445(__lwp_kill)错误

客户10.2.0.4 RAC环境出现ORA-600[17147]和ORA-7445[__lwp_kill]错误。
错误信息为:

Fri DEC 14 16:05:56 2012
Errors IN file /oraclelog/admin/orcl/bdump/orcl2_diag_27263.trc:
ORA-07445: exception encountered: core dump [__lwp_kill()+48] [SIGIOT] [UNKNOWN code] [0x000006A7F] [] []
ORA-00600: internal error code, arguments: [17147], [0x9FFFFFFFFD3E6BB8], [], [], [], [], [], []
Fri DEC 14 16:06:06 2012
Restarting dead background process DIAG
DIAG started WITH pid=6, OS id=12243

报错的进程是DIAG进程,出现错误后引发了ORA-7445错误,这个__lwp_kill函数在MOS中没有记录,但是在SUN的文档中找到了相关函数说明,这是操作系统关闭轻量级进程(lightweight process)是的函数调用。因此问题应该是diag出现了ORA-600[17147]错误,随后Oracle尝试通过系统函数调用__lwp_kill结束进程,但是引发了错误。
根据MOS文档Bug 7028176 – Memory corruption / OERI:17147 in DIAG in RAC [ID 7028176.8],当前的错误是10.2.0.4上的BUG,在RAC环境中出现私有内存错误导致DIAG进程出现ORA-600[17147]错误。
Oracle在10.2.0.4.4和10.2.0.5解决了这个错误。这种内存异常导致的错误一般并不会重现,出现的几率相对较低,因此如果没有大量报错也可以选择忽略。如果确实报错比较频繁,Oracle还在多个平台上都提供了这个BUG的单独PATCH。

Posted in BUG | Tagged , , , , , , | Leave a comment

ORA-600(17059)错误

客户10.2.0.4 RAC出现大量的ORA-600[17059]错误。
错误信息如下:

Tue May 21 09:55:30 2013
Errors IN file /oraclelog/admin/orcl/bdump/orcl1_j000_307.trc:
ORA-00600: 内部错误代码, 参数: [17059], [0xC000001346657EB8], [], [], [], [], [], []
Tue May 21 09:55:32 2013
Errors IN file /oraclelog/admin/orcl/bdump/orcl1_j000_307.trc:
ORA-00600: 内部错误代码, 参数: [17059], [0xC000001346657EB8], [], [], [], [], [], []
Tue May 21 09:55:33 2013
Errors IN file /oraclelog/admin/orcl/bdump/orcl1_j000_307.trc:
ORA-00600: 内部错误代码, 参数: [17059], [0xC000001346657EB8], [], [], [], [], [], []
Tue May 21 09:55:34 2013
Errors IN file /oraclelog/admin/orcl/bdump/orcl1_j000_307.trc:
ORA-00600: 内部错误代码, 参数: [17059], [0xC000001346657EB8], [], [], [], [], [], []
Tue May 21 09:55:36 2013
Errors IN file /oraclelog/admin/orcl/bdump/orcl1_j000_307.trc:
ORA-00600: 内部错误代码, 参数: [17059], [0xC000001346657EB8], [], [], [], [], [], []
Tue May 21 09:55:37 2013
Errors IN file /oraclelog/admin/orcl/bdump/orcl1_j000_307.trc:
ORA-00600: 内部错误代码, 参数: [17059], [0xC000001346657EB8], [], [], [], [], [], []
Tue May 21 09:55:38 2013
Errors IN file /oraclelog/admin/orcl/bdump/orcl1_j000_307.trc:
ORA-00600: 内部错误代码, 参数: [17059], [0xC000001346657EB8], [], [], [], [], [], []

详细TRACE信息如下:

*** 2013-07-03 08:16:46.382
*** ACTION NAME:() 2013-07-03 08:16:46.381
*** MODULE NAME:() 2013-07-03 08:16:46.381
*** SERVICE NAME:(SYS$USERS) 2013-07-03 08:16:46.381
*** SESSION ID:(460.3380) 2013-07-03 08:16:46.380
LIBRARY OBJECT HANDLE: handle=c000001349e9a160 mtx=c000001349e9a290(0) cdp=0
name=
INSERT INTO T_R_P_OY@D_UH (S_NUM ,P_CODE ,E_CODE ,CODE_STATE , T_CATE ,O_TYPE_CODE ,P_KEY ,O_TIME , O_DEPART_ID ,O_STAFF_ID ,R_TIME ,REMARK , P_TYPE ,P_ID ,F_TYPE ,FOREGIFT ,RSV1 , RSV2 ,RSV3 ,RSV4 ,RSV5 ,RSV6 , W_CARD_TYPE ,S_CODE ,P_KEY_MODE) VALUES (:B25 ,:B24 ,:B23 ,:B22 , :B21 ,:B20 ,:B19 ,:B18 , :B17 ,:B16 ,:B15 ,:B14 , :B13 ,:B12 ,:B11 ,:B10 ,:B9 , :B8 ,:B7 ,:B6 ,:B5 ,:B4 , :B3 ,:B2 ,:B
hash=191708dff85615cba49f2013927d9b70 TIMESTAMP=05-15-2013 16:56:26
namespace=CRSR flags=RON/KGHP/TIM/OBS/PN0/LRG/DBN/MTX/[104100c1]
kkkk-dddd-llll=0000-0001-0001 LOCK=N pin=0 latch#=10 hpc=7ada hlc=7ada
lwt=c000001349e9a208[c000001349e9a208,c000001349e9a208] ltm=c000001349e9a218[c000001349e9a218,c000001349e9a218]
pwt=c000001349e9a1d0[c000001349e9a1d0,c000001349e9a1d0] ptm=c000001349e9a1e0[c000001349e9a1e0,c000001349e9a1e0]
REF=c000001349e9a238[c000001349e9a238,c000001349e9a238] lnd=c000001349e9a250[c000001355aa6bf8,c0000013374654c8]
  LOCK OWNERS:
      LOCK     USER  SESSION COUNT mode flags
  -------- -------- -------- ----- ---- ------------------------
  c00000133808f6f0 c00000103d35e2b0 c00000103d35e2b0     1 N    [00]
  LIBRARY OBJECT: object=c000001346657eb8
  TYPE=CRSR flags=EXS[0001] pflags=[0000] STATUS=VALD LOAD=0
  CHILDREN: SIZE=32768
  child#    TABLE reference   handle
  ------ -------- --------- --------
       0 c00000134fc4cf30 c00000134fc4cba0 c000001340160d80
       1 c00000134fc4cf30 c000001363a76b48 c00000135dabf960
       2 c00000134fc4cf30 c00000135a292320 c00000136988b450
       3 c00000134fc4cf30 c000001337db37a0 c00000133487dac0
       4 c00000134fc4cf30 c00000135dcf2be8 c000001346fd8448
.
.
.
*** 2013-07-03 08:16:46.734
ksedmp: internal OR fatal error
ORA-00600: 内部错误代码, 参数: [17059], [0xC000001346657EB8], [], [], [], [], [], []
No CURRENT SQL statement being executed.
----- PL/SQL Call Stack -----
  object      line  object
  handle    NUMBER  name
c000001331c31eb0        31  PROCEDURE U_T.P_D_P_O
c000001339ef8bf0         1  anonymous block
----- Call Stack Trace -----
calling              CALL     entry                argument VALUES IN hex      
location             TYPE     point                (? means dubious VALUE)     
-------------------- -------- -------------------- ----------------------------
ksedst()+64          CALL     ksedst1()            000000000 ? 000000001 ?
ksedmp()+2176        CALL     ksedst()             000000000 ?
                                                   C000000000000C9F ?
                                                   4000000003FF5020 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ?
ksfdmp()+48          CALL     ksedmp()             000000003 ?
kgeriv()+336         CALL     ksfdmp()             C000000000000695 ?
                                                   000000003 ?
                                                   400000000946EB60 ?
                                                   0000CF507 ? 000000000 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ?
kgesiv()+192         CALL     kgeriv()             6000000000031370 ?
                                                   6000000000032428 ?
                                                   40000000019EDB40 ?
                                                   000000001 ?
                                                   9FFFFFFFFFFEC3B8 ?
kgesic1()+128        CALL     kgesiv()             6000000000031370 ?
                                                   9FFFFFFFFD3114C0 ?
                                                   9FFFFFFFFD3114D0 ?
                                                   000000001 ?
                                                   9FFFFFFFFFFEC3B8 ?
$cold_kgltba()+464   CALL     kgesic1()            6000000000031370 ?
                                                   9FFFFFFFFD3114C0 ?
                                                   0000042A3 ? 000000002 ?
                                                   C000001346657EB8 ?
                                                   C000000000004F26 ?
                                                   4000000003351960 ?
                                                   0000CF54D ?
kglhdgc()+608        CALL     $cold_kgltba()       6000000000031370 ?
                                                   C000001346657EB8 ?
                                                   C0000013466581D8 ?
                                                   C00000134F0E5288 ?
                                                   60000000000314E0 ?
                                                   C0000013DE36F380 ?
                                                   C00000134F0E5288 ?
                                                   C0000013F2E679D0 ?
kglget()+5536        CALL     kglhdgc()            6000000000031370 ?
                                                   C000001346657EB8 ?
                                                   9FFFFFFFFFFEC7EE ?
                                                   000000000 ? 000000000 ?
                                                   010010000 ?
                                                   C0000013466581D8 ?
                                                   9FFFFFFFFFFEC478 ?
kxsGetLookupLock()+  CALL     kglget()             6000000000031370 ?
336                                                9FFFFFFFFFFEC7D0 ?
                                                   000000001 ? 000000003 ?
                                                   9FFFFFFFFD370460 ?
kksfbc()+15568       CALL     kxsGetLookupLock()   6000000000031370 ?
                                                   9FFFFFFFFD370438 ?
                                                   9FFFFFFFFFFEC7D0 ?
                                                   C000000000001EC5 ?
                                                   4000000002EEFB60 ?
                                                   9FFFFFFFFD370458 ?
                                                   9FFFFFFFFD370460 ?
                                                   9FFFFFFFFFFEC7F8 ?
kkspbd0()+1184       CALL     kksfbc()             9FFFFFFFFFFED910 ?
                                                   400000000324E890 ?
                                                   0000C8213 ?
                                                   9FFFFFFFFFFEC5F4 ?
                                                   000000000 ? 000000000 ?
                                                   C000001346657FD8 ?
                                                   C000001346657FD8 ?
kksParseCursor()+81  CALL     kkspbd0()            000003F50 ? 000000003 ?
6                                                  000080000 ?
                                                   60000000000C6208 ?
                                                   9FFFFFFFFD370418 ?
                                                   9FFFFFFFFD370570 ?
                                                   60000000000B5E18 ?
opiosq0()+4320       CALL     kksParseCursor()     9FFFFFFFFFFEDB80 ?
                                                   C000000000001736 ?
                                                   4000000002E99F80 ?
                                                   9FFFFFFFFD3C1AB0 ?
                                                   0FFDFFFFF ?
                                                   60000000000C4BF0 ?
                                                   60000000000C4F48 ?
                                                   60000000000C4EE0 ?
opipls()+3392        CALL     opiosq0()            000000003 ?
                                                   9FFFFFFFFFFEE620 ?
                                                   4000000003009120 ?
                                                   00000F1DF ?
                                                   9FFFFFFFFFFEDA40 ?
                                                   60000000000B5E18 ?
                                                   C00000000000224C ?
                                                   0000001ED ?
opiodr()+2128        CALL     opipls()             9FFFFFFFFFFEEF30 ?
                                                   4000000002E42BA0 ?
                                                   00000E09F ?
                                                   9FFFFFFFFFFEE680 ?
                                                   60000000000B5E18 ?
                                                   9FFFFFFFFFFEEA10 ?
rpidrus()+352        CALL     opiodr()             000000066 ? 000000006 ?
                                                   4000000001B01EC0 ?
                                                   0000046B0 ?
                                                   9FFFFFFFFFFEEF40 ?
                                                   60000000000B5E18 ?
skgmstack()+288      CALL     rpidrus()            9FFFFFFFFFFF1680 ?
                                                   9FFFFFFFFFFF10C0 ?
                                                   60000000000B5E18 ?
                                                   9FFFFFFFFFFF1640 ?
                                                   C000000000000716 ?
                                                   4000000002E819C0 ?
                                                   00000E05F ?
                                                   9FFFFFFFFFFF1120 ?
.
.
.

报错为ORA-600(17059)的BUG有很多,不过根据报错堆栈信息以及ORA-600之前出现的LIBRARY CACHE信息,可以确认Bug 8922013 – OERI [17059] / excess child cursors for SQL referencing REMOTE objects [ID 8922013.8]描述的问题和当前一致。
当语句通过数据库链进行远端访问,可能会造成子游标无法重用,从而导致大量的ORA-600错误的出现。这个问题影响10.2.0.4和10.2.0.5,Oracle仍然没有彻底解决这个问题,事实上是在10.2.0.4中解决了Bug 4581334 Cursors accessing remote tables may be repeatedly rebuilt and not used的问题,而引发了当前的BUG。Oracle目前并没有给出一个解决方案,也没有明确哪个版本会修正这个问题。这是针对一些平台提供了单独的补丁,但是安装这个补丁意味着撤销对BUG 4581334的修正。
这个BUG并不会导致任何功能上的问题,但是可能会在短时间内生成大量的TRACE从而将ORACLE_BASE目录撑满。

Posted in BUG | Tagged , , , , , | Leave a comment

系统存在严重的latch: undo global data等待

客户10.2.0.5 RAC环境出现了严重的latch: undo global data等待。

问题时刻AWR的TOP如下:

Event

Waits

Time(s)

Avg Wait(ms)

% Total Call   Time

Wait Class

latch: undo   global data

6,245,400

1,372,583

220

22.0

Other

gc buffer busy

114,190,782

1,329,749

12

21.3

Cluster

enq: TX – row   lock contention

1,377,980

685,454

497

11.0

Application

CPU time

460,041

7.4

enq: TX – index   contention

602,648

285,683

474

4.6

Concurrency

 

等待最明显的是latch: undo global data和gc buffer busy,后者是RAC中比较常见的等待,也可以根据报告后面的SQL等待部分很容易定位到导致问题的SQL语句,而前者的等待并不常见。

根据MOS文档”LATCH: UNDO GLOBAL DATA” In The Top Wait Events [ID 1451536.1]描述,这个等待和隐含参数_undo_autotune设置为FALSE情况下的UNDO空间不足有关

当前数据库确实关闭了_undo_autotune功能。且LATCH undo global data最多的等待发生在ktusm_stealext: KSLBEGIN处,这说明会话在寻找新的UNDO EXTENTS时,不得不Steal未过期的UNDO EXTENTS。

解决方案有三个:减少UNDO_RETENTION参数设置的时间长度;增加UNDO_TABLESPACE的空间大小;将_undo_autotune隐含参数设置为TRUE。

 

Posted in BUG | Tagged , , , | Leave a comment

20130712 LSI渠道启动峰会

参加了LSI渠道启动峰会,还碰到了沃趣科技的CEO。
原则上讲云和恩墨不算是LSI的渠道商,不过之前与LSI合作对Nytro WarpDrive卡和Nytro MegaRAID对于Oracle数据库性能的提升。因此LSI把我们作为合作伙伴也邀请到了今天的渠道峰会的会场。
最近恰好也在一个客户处进行LSI的POC,客户的数据库和SQL都进行了优化,优化后数据库的主要压力落在IO上,因此考虑通过缓存的方式来进行优化。如果缓存的测试效果非常明显,到时候会将对比效果公布出来。
而沃趣科技之前一直和FusionIO合作,本次来参加LSI的会议,应该也是作为技术合作伙伴。和grassbell还是第一次见面,虽然早在04年就已经在ITPUB上相识,不过随着他到了阿里,此后就一直没有机会见面,没想到这次见面的机会还是LSI提供给我们的。

Posted in NEWS | Leave a comment

Oracle DBA实战攻略:运维管理、诊断优化、高可用与最佳实践——序

印象中很少帮人写序,提笔后才发现这也不是件轻松的事情。
初识周亮应该是2012年10月到杭州出差,经一个杭州的同事介绍认识的,虽然是第一次见面,但是聊得却很投机:一方面大家都是这个圈子里的人,虽然之前没有打过交道,但是有着很多共同的朋友,因此感觉比较亲切;另一方面是我们两个的相似点非常多,因此共同语言也就比较多。大家都是Oracle的DBA,都对Oracle技术感兴趣,也都在Oracle的圈子内积累了不少年头,即便是只聊技术,也能找到说不完的话题。更何况我们两个人的工作性质几乎完全一样。我们两个目前都是从事乙方运维的工作,我是2011年开始从事Oracle数据库的乙方的运维管理,而周亮的乙方运维经验则比我长得多,如果从这个角度讲,他的Oracle数据库运维经验要比我丰富得多,这无疑也使我们增加了更多的共同话题。于是当天晚上,从Oracle数据库聊到了具体的案例,从工作聊到了客户,从技术聊到了团队,估计当天如果在聊的晚一点,就差谈人生和理想了。
也就是在当晚的聊天过程中,我听说他正在写书,而且已经动笔几个月了。当时Eygle的新书刚刚出版不久,而我几个熟悉的朋友也恰好都在写书,所以这方面的话题也比较多,也就多聊了几句,当时对他的新书有了一个第一印象:这本书是根据案例和实践经验整理而成的,当时跟我的感觉应该和DBA手记系列比较相似。
之后的几个月,听说他仍然在笔耕不缀。乙方工作的辛苦自己是深有感触的,能在日常繁忙工作的基础上,把一些知识要点记录下来已经是很少有人能做到的事情了,而还能坚持不断的写作,这绝对是需要毅力才能完成的工作。因为写作和写BLOG是不同的,虽然二者可能都是需要每天挤出半个小时到一个小时的时间,前者要求的整块的时间,否则难以展开思路,而对于后者而言,如果有了什么思路或者碰到了什么案例,可以用很短的时间先记录下来,随后也可以通过零碎的时间进行实验和验证,说起来似乎是同样的一小时时间,前者的坚持和付出要远远高于后者。
到了最近他成书后邀请我写序,我才发现他的书并不是简单的案例集合,而是存在一条主线将各个知识点串联在一起,这在成书的困难程度上就要比DBA手记之类的案例集合至少上了一个台阶。
从严格意义上讲,我没有写过书,只是写过文章。虽然参与了《Oracle数据库性能优化》以及《DBA手记》等书的编写,但形式仍然是供稿方式。这几本书的最大特点在于没有一条明确的主线来穿起所有的内容,书中的各个章节独立,因此成书相对来说要容易得多。而如果要独立写一本书,就需要考虑完整的架构、背景知识、贯穿全书的主线、以及选取的案例是否合适等等,显然这要比纯案例的组合困难得多。而即便是案例类的书籍,成书过程也并不简单。先不说素材、案例的选取,知识点深入程度的把控这些对于技术含量要求很高的事情。只说整体书籍编写工作完成后,对全书的修订和审阅工作也不是一件轻松的事情。因为经历过几次这种不断的修订的工作,每次都是一个很痛苦的过程,以至于当现在为止还记忆犹新。而如果要我独立写一本书,要有明确的主线,要有详尽的基础知识介绍,要覆盖Oracle的主要体系结构点,还要有大量深入的案例作为佐证,最好还要有一些尚未公开的研究结果,光想想就已经足够了。因此,我一直很钦佩那些能独立完成一本书的人,无疑作者也是令我钦佩的大牛中的一员。
作者令我钦佩的不仅仅是技术上的积累,也不只是我提到的为了成书而付出的艰辛,而更令我钦佩的是其对技术几年如一日的持续专注的态度。最近微博上还有人讨论是否应该写书。国内的技术和出版的大环境,决定了靠写技术书籍赚钱绝对是Mission impossible。即使是Eygle这种几乎一年一本的速度,且每本都很畅销的情况,也完全不足以通过写书养活自己,那就更不用说其他人了。通过写书来出名也越来越难,其实很多人是因为作者本身的名气才买的书。因此有位网友说得很有道理,别人知道你是因为你做了什么,而不是你写了什么。因此想要靠写书来成就自己的名气也不是件容易的事情。在现阶段无论是收名还是获利,想通过写书的方式来实现,其投入产出比都非常不好的。但是也正因为如此,以名利为目的作者越来越少,而写书更多的变成了一些有技术追求的人,把自己的技术沉淀、总结和提升的过程。大浪淘沙始得金,我们欣喜的看到,近期出版的和即将要出版的几本Oracle书籍,无不是业内专业人士的呕心力作。而作者的这边《Oracle DBA实战攻略:运维管理、诊断优化、高可用与最佳实践》正是其中之一。
作者根据他多年的运维诊断经验,从数据库如何创建开始,循序渐进的介绍了数据库的启动关闭过程,如何配置监听并连接到数据库,如果对数据库空间进行管理和监控,SGA的调整和优化方法,数据库的CHECKPOINT和SCN机制与备份恢复,数据库性能优化的方法论以及Oracle Data Guard的配置和管理。仅看目录似乎都是非常基础的内容,似乎这是本针对Oracle初学者的入门级书籍,其实恰恰相反,所谓大道至简,能把最基础的东西写出新意并加入自己的理解本身就是一件非常见功底的事情。而且作者结合了大量的真实案例,把自己多年的宝贵经验融入其中,通过一些复杂案例的诊断过程来说明这些简单的原理和知识点,这正是作者高明的地方。更何况作者并没有简单的停留在案例诊断分析的层面上,而是根据大量案例的经验汇总,把问题的优化、诊断和解决提升到了方法论的层面上,这就不仅仅是通过几年经验积累就可以轻易达成的,必要要求作者不断的思考、分析、归纳和验证,才能上升到理论指导实践的层面。
最后希望周亮的《Oracle DBA实战攻略:运维管理、诊断优化、高可用与最佳实践》一书,可以帮助更多的数据库技术爱好者解决日常碰到的技术困难,给那些以日常操作入门的运维人员指出一条深入学习提高的道路。

Posted in BOOKS | Leave a comment