查询条件为ROWNUM=1仍产生长时间等待

正常情况下,指定了ROWNUM=1会迅速的返回结果,但是也有异常的情况产生。
其实ROWNUM=1之所以可以快速的返回结果,是由于执行计划中采用了STOPKEY的方式,当查询到第一条符合要求的记录后,执行就中止了。
但是ROWNUM=1并非对所有情况都适用,比如如果内存查询包含GROUP BY操作,那么ROWNUM=1就无法推到GROUP BY查询内部,因此这时的ROWNUM=1的条件对于查询速度的提高就非常有限了。还有一种情况,在指定了一个限制条件后,加上ROWNUM=1后,发现查询效率仍然不高。这时因为表中满足指定限制条件的记录非常少或者根本没有,以致于Oracle要扫描全部表数据后才能返回结果。
但是客户碰到了的现象和上面描述的几种情况都不相符,查询只是一个单表查询,唯一的限制条件就是ROWNUM=1,没有GROUP BY和ORDER BY语句,语句就是简单到不能再简单的:select * from RTDRULETRACEHISTORY where rownum = 1。
检查了这个语句的执行计划,并未发现异常之处:

Id  Operation             Name                ROWS  Bytes  Cost (%CPU) TIME  
0   SELECT STATEMENT                                       2 (100)   
1     COUNT STOPKEY            
2       TABLE ACCESS FULL RTDRULETRACEHISTORY  1    827    2 (0)       00:00:01

尝试指定索引列的条件,通过索引扫描的方式来访问这个表,查询瞬间就可以返回结果。这说明表本身没有问题。观察查询时刻的等待事件,发现绝大部分是正常的db file scattered read和gc cr multi block request等待,这说明并不是latch之类的将资源锁住。

*** 2011-10-18 10:16:35.765
*** ACTION NAME:() 2011-10-18 10:16:35.722
*** MODULE NAME:(SQL*Plus) 2011-10-18 10:16:35.722
*** SERVICE NAME:(SYS$USERS) 2011-10-18 10:16:35.722
*** SESSION ID:(1500.50317) 2011-10-18 10:16:35.722
WAIT #8: nam='SQL*Net message to client' ela= 9 driver id=1650815232 #bytes=1 p3=0 obj#=12803 tim=31019702733135
*** 2011-10-18 10:16:46.489
WAIT #8: nam='SQL*Net message from client' ela= 10471689 driver id=1650815232 #bytes=1 p3=0 obj#=12803 tim=31019713247383
=====================
PARSING IN CURSOR #4 len=50 dep=0 uid=33 oct=3 lid=33 tim=31019713253946 hv=3186239259 ad='964fa8b0'
SELECT * FROM RTDRULETRACEHISTORY WHERE rownum = 1
END OF STMT
PARSE #4:c=10000,e=5864,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,tim=31019713253748
BINDS #4:
EXEC #4:c=0,e=215,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,tim=31019713254416
WAIT #4: nam='SQL*Net message to client' ela= 13 driver id=1650815232 #bytes=1 p3=0 obj#=12803 tim=31019713254581
WAIT #4: nam='gc cr multi block request' ela= 2041 file#=22 block#=352168 class#=1 obj#=12803 tim=31019713257631
WAIT #4: nam='db file scattered read' ela= 1195 file#=22 block#=352164 blocks=5 obj#=12803 tim=31019713259143
WAIT #4: nam='gc cr multi block request' ela= 1888 file#=22 block#=352176 class#=1 obj#=12803 tim=31019713261925
WAIT #4: nam='db file scattered read' ela= 1732 file#=22 block#=352169 blocks=8 obj#=12803 tim=31019713264136
WAIT #4: nam='gc cr multi block request' ela= 1699 file#=22 block#=352184 class#=1 obj#=12803 tim=31019713266769
WAIT #4: nam='db file scattered read' ela= 764 file#=22 block#=352178 blocks=7 obj#=12803 tim=31019713267883
WAIT #4: nam='gc cr multi block request' ela= 1357 file#=22 block#=352192 class#=1 obj#=12803 tim=31019713270308
WAIT #4: nam='gc cr multi block request' ela= 215 file#=22 block#=352192 class#=1 obj#=12803 tim=31019713270605
WAIT #4: nam='db file scattered read' ela= 1489 file#=22 block#=352185 blocks=8 obj#=12803 tim=31019713272209
WAIT #4: nam='gc cr multi block request' ela= 46 file#=22 block#=352200 class#=1 obj#=12803 tim=31019713273728
WAIT #4: nam='gc cr multi block request' ela= 1897 file#=22 block#=352200 class#=1 obj#=12803 tim=31019713275710
WAIT #4: nam='db file scattered read' ela= 836 file#=22 block#=352194 blocks=7 obj#=12803 tim=31019713276890
WAIT #4: nam='gc cr multi block request' ela= 1555 file#=22 block#=352208 class#=1 obj#=12803 tim=31019713279226
WAIT #4: nam='db file scattered read' ela= 1397 file#=22 block#=352201 blocks=8 obj#=12803 tim=31019713280861
WAIT #4: nam='gc cr multi block request' ela= 2194 file#=22 block#=352216 class#=1 obj#=12803 tim=31019713284098
WAIT #4: nam='db file scattered read' ela= 971 file#=22 block#=352210 blocks=7 obj#=12803 tim=31019713285421
WAIT #4: nam='gc cr multi block request' ela= 1394 file#=22 block#=352224 class#=1 obj#=12803 tim=31019713287803
WAIT #4: nam='gc cr multi block request' ela= 153 file#=22 block#=352222 class#=1 obj#=12803 tim=31019713288044
WAIT #4: nam='db file scattered read' ela= 1206 file#=22 block#=352217 blocks=8 obj#=12803 tim=31019713289370
WAIT #4: nam='gc cr multi block request' ela= 1577 file#=22 block#=352232 class#=1 obj#=12803 tim=31019713291581
WAIT #4: nam='db file scattered read' ela= 1157 file#=22 block#=352226 blocks=7 obj#=12803 tim=31019713293069
WAIT #4: nam='gc cr multi block request' ela= 2132 file#=22 block#=352240 class#=1 obj#=12803 tim=31019713296012
WAIT #4: nam='db file scattered read' ela= 1285 file#=22 block#=352233 blocks=8 obj#=12803 tim=31019713297485
WAIT #4: nam='gc cr multi block request' ela= 1575 file#=22 block#=352248 class#=1 obj#=12803 tim=31019713300039
WAIT #4: nam='db file scattered read' ela= 2419 file#=22 block#=352242 blocks=7 obj#=12803 tim=31019713302788
WAIT #4: nam='gc cr multi block request' ela= 31 file#=22 block#=352255 class#=1 obj#=12803 tim=31019713303561
WAIT #4: nam='gc cr multi block request' ela= 1354 file#=22 block#=352255 class#=1 obj#=12803 tim=31019713305054
WAIT #4: nam='db file scattered read' ela= 1346 file#=22 block#=352249 blocks=8 obj#=12803 tim=31019713306635
WAIT #4: nam='db file scattered read' ela= 740 file#=22 block#=352258 blocks=4 obj#=12803 tim=31019713308265
WAIT #4: nam='db file scattered read' ela= 971 file#=22 block#=352263 blocks=2 obj#=12803 tim=31019713309864
WAIT #4: nam='db file scattered read' ela= 1146 file#=23 block#=353033 blocks=7 obj#=12803 tim=31019713311595
WAIT #4: nam='db file scattered read' ela= 1169 file#=23 block#=353042 blocks=7 obj#=12803 tim=31019713313700
WAIT #4: nam='db file scattered read' ela= 2273 file#=23 block#=353049 blocks=8 obj#=12803 tim=31019713316548
WAIT #4: nam='db file scattered read' ela= 1663 file#=24 block#=357643 blocks=16 obj#=12803 tim=31019713319295
WAIT #4: nam='db file scattered read' ela= 3674 file#=24 block#=357659 blocks=16 obj#=12803 tim=31019713324249
WAIT #4: nam='db file scattered read' ela= 2430 file#=24 block#=357675 blocks=16 obj#=12803 tim=31019713328180
WAIT #4: nam='db file scattered read' ela= 2894 file#=24 block#=357691 blocks=16 obj#=12803 tim=31019713332267
WAIT #4: nam='db file scattered read' ela= 7359 file#=24 block#=357707 blocks=16 obj#=12803 tim=31019713341148
WAIT #4: nam='db file scattered read' ela= 1682 file#=24 block#=357723 blocks=8 obj#=12803 tim=31019713343705
WAIT #4: nam='db file scattered read' ela= 1493 file#=24 block#=357732 blocks=16 obj#=12803 tim=31019713346356
WAIT #4: nam='gc cr multi block request' ela= 47 file#=24 block#=357763 class#=1 obj#=12803 tim=31019713347548
WAIT #4: nam='gc cr multi block request' ela= 1538 file#=24 block#=357763 class#=1 obj#=12803 tim=31019713349157
WAIT #4: nam='gc cr multi block request' ela= 79 file#=24 block#=357763 class#=1 obj#=12803 tim=31019713349320
WAIT #4: nam='gc cr multi block request' ela= 61 file#=24 block#=357763 class#=1 obj#=12803 tim=31019713349576
WAIT #4: nam='gc cr multi block request' ela= 62 file#=24 block#=357763 class#=1 obj#=12803 tim=31019713349688
WAIT #4: nam='db file scattered read' ela= 2292 file#=24 block#=357748 blocks=16 obj#=12803 tim=31019713352166
WAIT #4: nam='gc cr multi block request' ela= 1713 file#=24 block#=357768 class#=1 obj#=12803 tim=31019713354600
WAIT #4: nam='db file scattered read' ela= 942 file#=24 block#=357764 blocks=5 obj#=12803 tim=31019713355837
WAIT #4: nam='db file scattered read' ela= 1830 file#=22 block#=352267 blocks=16 obj#=12803 tim=31019713358515
WAIT #4: nam='db file scattered read' ela= 2326 file#=22 block#=352283 blocks=16 obj#=12803 tim=31019713362098
WAIT #4: nam='db file scattered read' ela= 1546 file#=22 block#=352299 blocks=16 obj#=12803 tim=31019713364587
WAIT #4: nam='db file scattered read' ela= 2004 file#=22 block#=352315 blocks=16 obj#=12803 tim=31019713367496
WAIT #4: nam='db file scattered read' ela= 1926 file#=22 block#=352331 blocks=16 obj#=12803 tim=31019713370250
WAIT #4: nam='db file scattered read' ela= 1750 file#=22 block#=352347 blocks=16 obj#=12803 tim=31019713373524
WAIT #4: nam='db file scattered read' ela= 1375 file#=22 block#=352363 blocks=16 obj#=12803 tim=31019713375927
WAIT #4: nam='gc cr multi block request' ela= 2280 file#=22 block#=352392 class#=1 obj#=12803 tim=31019713380187
WAIT #4: nam='db file scattered read' ela= 6749 file#=22 block#=352379 blocks=14 obj#=12803 tim=31019713387286
WAIT #4: nam='gc cr multi block request' ela= 2624 file#=23 block#=353178 class#=1 obj#=12803 tim=31019713391083
WAIT #4: nam='db file scattered read' ela= 1645 file#=23 block#=353163 blocks=16 obj#=12803 tim=31019713392893
WAIT #4: nam='gc cr multi block request' ela= 1470 file#=23 block#=353194 class#=1 obj#=12803 tim=31019713395432
WAIT #4: nam='gc cr multi block request' ela= 498 file#=23 block#=353194 class#=1 obj#=12803 tim=31019713396168
WAIT #4: nam='db file scattered read' ela= 2708 file#=23 block#=353179 blocks=16 obj#=12803 tim=31019713399190
WAIT #4: nam='gc cr multi block request' ela= 70 file#=23 block#=353210 class#=1 obj#=12803 tim=31019713401176
WAIT #4: nam='gc cr multi block request' ela= 2393 file#=23 block#=353210 class#=1 obj#=12803 tim=31019713403660
WAIT #4: nam='db file scattered read' ela= 2962 file#=23 block#=353195 blocks=16 obj#=12803 tim=31019713406784
WAIT #4: nam='gc cr multi block request' ela= 2055 file#=23 block#=353220 class#=1 obj#=12803 tim=31019713409738
WAIT #4: nam='db file scattered read' ela= 1882 file#=23 block#=353211 blocks=10 obj#=12803 tim=31019713411882
WAIT #4: nam='gc cr multi block request' ela= 43 file#=23 block#=353237 class#=1 obj#=12803 tim=31019713414032
WAIT #4: nam='gc cr multi block request' ela= 2417 file#=23 block#=353237 class#=1 obj#=12803 tim=31019713416569
WAIT #4: nam='db file scattered read' ela= 1825 file#=23 block#=353222 blocks=16 obj#=12803 tim=31019713418775
WAIT #4: nam='gc cr multi block request' ela= 1917 file#=23 block#=353245 class#=1 obj#=12803 tim=31019713421691
WAIT #4: nam='gc cr multi block request' ela= 90 file#=23 block#=353245 class#=1 obj#=12803 tim=31019713421908
WAIT #4: nam='gc cr multi block request' ela= 62 file#=23 block#=353245 class#=1 obj#=12803 tim=31019713422020
WAIT #4: nam='gc cr multi block request' ela= 84 file#=23 block#=353245 class#=1 obj#=12803 tim=31019713422253
WAIT #4: nam='db file scattered read' ela= 1067 file#=23 block#=353238 blocks=8 obj#=12803 tim=31019713423663
WAIT #4: nam='gc cr multi block request' ela= 41 file#=23 block#=353251 class#=1 obj#=12803 tim=31019713424512
WAIT #4: nam='gc cr multi block request' ela= 1167 file#=23 block#=353251 class#=1 obj#=12803 tim=31019713425733
WAIT #4: nam='gc cr multi block request' ela= 265 file#=23 block#=353251 class#=1 obj#=12803 tim=31019713426060
WAIT #4: nam='db file scattered read' ela= 1041 file#=23 block#=353247 blocks=5 obj#=12803 tim=31019713427205
WAIT #4: nam='gc cr multi block request' ela= 2468 file#=23 block#=353268 class#=1 obj#=12803 tim=31019713430616
WAIT #4: nam='db file scattered read' ela= 1562 file#=23 block#=353253 blocks=16 obj#=12803 tim=31019713432505
.
.
.
WAIT #4: nam='db file scattered read' ela= 12261 file#=23 block#=582746 blocks=16 obj#=12803 tim=31019894112271
WAIT #4: nam='db file scattered read' ela= 4545 file#=23 block#=582762 blocks=16 obj#=12803 tim=31019894118627
WAIT #4: nam='gc cr multi block request' ela= 55 file#=23 block#=582792 class#=1 obj#=12803 tim=31019894119991
WAIT #4: nam='gc cr multi block request' ela= 2321 file#=23 block#=582792 class#=1 obj#=12803 tim=31019894122435
WAIT #4: nam='db file scattered read' ela= 35721 file#=23 block#=582778 blocks=15 obj#=12803 tim=31019894158558
WAIT #4: nam='db file scattered read' ela= 2538 file#=24 block#=588562 blocks=16 obj#=12803 tim=31019894163029
FETCH #4:c=73150000,e=180908720,p=532660,cr=537132,cu=0,mis=0,r=1,dep=0,og=1,tim=31019894163425
WAIT #4: nam='SQL*Net message from client' ela= 2818 driver id=1650815232 #bytes=1 p3=0 obj#=12803 tim=31019894166521
FETCH #4:c=0,e=54,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,tim=31019894166683
WAIT #4: nam='SQL*Net message to client' ela= 3 driver id=1650815232 #bytes=1 p3=0 obj#=12803 tim=31019894166836
*** 2011-10-18 10:20:15.486
WAIT #4: nam='SQL*Net message from client' ela= 23178892 driver id=1650815232 #bytes=1 p3=0 obj#=12803 tim=31019917345799
STAT #4 id=1 cnt=1 pid=0 pos=1 obj=0 op='COUNT STOPKEY (cr=537132 pr=532660 pw=0 time=180908666 us)'
STAT #4 id=2 cnt=1 pid=1 pos=1 obj=12803 op='TABLE ACCESS FULL RTDRULETRACEHISTORY (cr=537132 pr=532660 pw=0 time=180908601 us)'
=====================
PARSING IN CURSOR #10 len=55 dep=0 uid=33 oct=42 lid=33 tim=31019917347295 hv=2217940283 ad='0'
ALTER SESSION SET events '10046 trace name context off'
END OF STMT
PARSE #10:c=0,e=633,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=0,tim=31019917347273
EXEC #10:c=0,e=283,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=0,tim=31019917347733

询问了一下这张表的业务处理逻辑,找到了最终问题所在。这是表存放的数据只会保留一段时间,因此每天都会执行删除操作来删除。大量删除的操作使得表的前部产生大量的空块,因此Oracle即使要获取一条数据,也要扫描这些空块知道找到第一条未删除的记录。这和即使表中没有数据,但是高水位线很高,COUNT(*)操作也要执行很长时间的道理是一样的。
根据表的统计信息计算,表中接近一半的大小的空的,而这些空行几种分布在表的前端,造成了这个问题的产生,解决问题的方法很简单,对于可以短时间中断业务的环境,只需要MOVE然后REBUILD索引既可,而对于业务不能停顿的系统,可以使用SHRINK来在线回收表中的空闲空间,当然如果表上的操作比较频繁,还可以使用在线重定义来接近问题。
虽然问题解决了,但是要规避这个问题的再次出现,还必须建立适合的数据生命周期管理方案,通过合理高效的方式来处理数据,避免通过大量的DELETE删除过期数据。

Posted in ORACLE | Tagged , , | Leave a comment

FAL_SERVER设置错误导致SWITCHOVER切换HANG住

客户在进行DATA GUARD的SWITCHOVER时,操作长时间没有反应。
操作如下:

SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY WITH SESSION SHUTDOWN;

这个命令等待很长时间没有结束,虽然数据库的SWITCHOVER_STATUS并非TO STANDBY而是SESSION ACTIVE,但是几乎所有的LOCAL=NO的连接已经在操作系统级kill掉了,而且使用了WITH SESSION SHUTDOWN语法,理论上不应该是连接问题导致SWITCHOVER无法完成。
检查告警日志:

Fri Oct 14 23:03:45 2011
ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY WITH SESSION SHUTDOWN
Fri Oct 14 23:03:45 2011
ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY (db1)
Fri Oct 14 23:03:47 2011
Thread 1 cannot allocate NEW log, SEQUENCE 133983
Private strand FLUSH NOT complete
CURRENT log# 4 seq# 133982 mem# 0: /dev/rredodbs41
CURRENT log# 4 seq# 133982 mem# 1: /dev/rredodbs42
Fri Oct 14 23:03:50 2011
Thread 1 advanced TO log SEQUENCE 133983 (LGWR switch)
CURRENT log# 5 seq# 133983 mem# 0: /dev/rredodbs51
CURRENT log# 5 seq# 133983 mem# 1: /dev/rredodbs52
Fri Oct 14 23:03:50 2011
Waiting FOR ALL non-CURRENT ORLs TO be archived...
Fri Oct 14 23:03:50 2011
Waiting FOR the ORL FOR thread 1 SEQUENCE 133982 TO be archived...
Fri Oct 14 23:04:00 2011
ORL FOR thread 1 SEQUENCE 133982 has been archived...
Fri Oct 14 23:04:00 2011
ALL non-CURRENT ORLs have been archived.
Fri Oct 14 23:04:00 2011
Waiting FOR ALL FAL entries TO be archived...
Fri Oct 14 23:25:29 2011
ORA-1013 signalled during: ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY WITH SESSION SHUTDOWN...

数据库在等待和FAL配置有关,由于数据库的操作一直停在了这里,最后不得以中止了SWITCHOVER的操作。
检查客户的FAL相关配置,发现FAL_SERVER的设置可能存在问题:

SQL> SHOW PARAMETER FAL 
NAME                       TYPE        VALUE
-------------------------- ----------- --------------------
fal_client                 string      db1
fal_server                 string      db1st,db1st2

由于当前主库配置了两个STANDBY数据库,分别是db1st和db2st,当前的设置明显是希望无论主库和那个备库进行切换后,通过设置两个fal_server,DATA GUARD配置都可以正常,而不需要修改配置。
但是目前看来,似乎这种设置两个参数的方式并不生效,使得在SWITCHOVER的时Oracle停了下来。
将FAL_SERVER设置为要进行SWITCHOVER的备库的TNS服务名:

SQL> ALTER SYSTEM SET fal_server='db1st';
System altered.

重启数据库,使得参数生效。虽然这个参数的修改不需要重启数据库,但是在运行SWITCHOVER的时候发现,如果不重启数据库,Oracle并没有重新获取FAL_SERVER的值,而是根据本次启动时加载的FAL_SERVER的值。
重新启动后再次运行SWITCHOVER,切换成功完成:

SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY WITH SESSION SHUTDOWN;
DATABASE altered.

一直认为FAL_SERVER和FAL_CLIENT只是DATA GUARD配置用来自动获取GAP的参数,没想到这个参数的设置对于SWITCHOVER也有这么大的影响。

Posted in ORACLE | Tagged , , , | Leave a comment

DATA GUARD出现ORA-16032错误

客户在配置DATA GUARD过程中碰到ORA-1031和ORA-16032错误。
其中ORA-1031错误很直观,直接报错在归档时没有权限:

Wed Sep 28 15:43:26 2011
ALTER SYSTEM SET log_archive_dest_2='service=db1st lgwr async valid_for=(ONLINE_LOGFILES,PRIMARY_ROLE) db_unique_name=db1st' SCOPE=BOTH;
Wed Sep 28 15:43:26 2011
ALTER SYSTEM SET log_archive_dest_state_1='enable' SCOPE=BOTH;
Wed Sep 28 15:43:26 2011
ALTER SYSTEM SET log_archive_dest_state_2='enable' SCOPE=BOTH;
Wed Sep 28 15:43:26 2011
ALTER SYSTEM SET log_archive_config='DG_CONFIG=(db1,db1st)' SCOPE=BOTH;
Wed Sep 28 15:43:26 2011
ALTER SYSTEM SET fal_client='db1' SCOPE=BOTH;
Wed Sep 28 15:43:26 2011
ALTER SYSTEM SET fal_server='db1st' SCOPE=BOTH;
Wed Sep 28 15:43:26 2011
ALTER SYSTEM SET standby_file_management='auto' SCOPE=BOTH;
Wed Sep 28 15:43:26 2011
ARC0: STARTING ARCH PROCESSES
Wed Sep 28 15:43:26 2011
ALTER SYSTEM SET log_archive_max_processes=3 SCOPE=BOTH;
ARC2: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
ARC2 started WITH pid=242, OS id=925904
Wed Sep 28 15:43:26 2011
ALTER SYSTEM SET db_file_name_convert='/dev/','/dev/' SCOPE=SPFILE;
Wed Sep 28 15:43:26 2011
ALTER SYSTEM SET log_file_name_convert='/dev/','/dev/' SCOPE=SPFILE;
Wed Sep 28 15:44:02 2011
Error 1031 received logging ON TO the standby
Wed Sep 28 15:44:02 2011
Errors IN file /oracle/admin/db1/bdump/db1_arc0_90394.trc:
ORA-01031: insufficient privileges
PING[ARC0]: Heartbeat failed TO CONNECT TO standby 'db1st'. Error IS 1031.
Wed Sep 28 15:49:03 2011
Error 1031 received logging ON TO the standby
Wed Sep 28 15:49:03 2011
Errors IN file /oracle/admin/db1/bdump/db1_arc0_90394.trc:
ORA-01031: insufficient privileges
PING[ARC0]: Heartbeat failed TO CONNECT TO standby 'db1st'. Error IS 1031.
Wed Sep 28 15:54:03 2011
Error 1031 received logging ON TO the standby
Wed Sep 28 15:54:03 2011
Errors IN file /oracle/admin/db1/bdump/db1_arc0_90394.trc:
ORA-01031: insufficient privileges
PING[ARC0]: Heartbeat failed TO CONNECT TO standby 'db1st'. Error IS 1031.
Wed Sep 28 15:57:29 2011
Archivelog restore complete. Elapsed TIME: 0:00:00 
Wed Sep 28 15:59:03 2011
Error 1031 received logging ON TO the standby
Wed Sep 28 15:59:03 2011
Errors IN file /oracle/admin/db1/bdump/db1_arc0_90394.trc:
ORA-01031: insufficient privileges
PING[ARC0]: Heartbeat failed TO CONNECT TO standby 'db1st'. Error IS 1031.

在归档的详细日志中,错误如下:

/oracle/admin/db1/bdump/db1_arc0_90394.trc
Oracle DATABASE 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
WITH the Partitioning, OLAP AND DATA Mining options
ORACLE_HOME = /oracle/product/db10gr2
System name: AIX
Node name: ser1
Release: 3
Version: 5
Machine: 00C23C4F4C00
Instance name: db1
Redo thread mounted BY this instance: 1
Oracle process NUMBER: 14
Unix process pid: 90394, image: oracle@ser1 (ARC0)
*** 2011-09-28 15:42:05.012
*** SERVICE NAME:(SYS$BACKGROUND) 2011-09-28 15:42:05.006
*** SESSION ID:(541.1) 2011-09-28 15:42:05.006
Redo shipping client performing standby login
OCISessionBegin failed -1
.. Detailed OCI error val IS 1031 AND errmsg IS 'ORA-01031: insufficient privileges
'
*** 2011-09-28 15:42:05.101 61287 kcrr.c
Error 1031 received logging ON TO the standby
Error 1031 connecting TO destination LOG_ARCHIVE_DEST_1 standby host 'db1st'
kcrrwupirfs KCRROCIS Handle 0xfffffffffffb608 has NULL OCI servicehandle - Returning success
Error 1031 attaching TO destination LOG_ARCHIVE_DEST_1 standby host 'db1st'
ORA-01031: insufficient privileges
*** 2011-09-28 15:42:05.101 59530 kcrr.c
kcrrfail: dest:1 err:1031 force:0 blast:1
kcrrwupirfs KCRROCIS Handle 0xfffffffffffb608 has NULL OCI servicehandle - Returning success
kcrrwkx: UNKNOWN error:1031
ORA-16055: FAL request rejected
ARCH: Connecting TO console port...
ARCH: Connecting TO console port...
Redo shipping client performing standby login
OCISessionBegin failed -1
.. Detailed OCI error val IS 1031 AND errmsg IS 'ORA-01031: insufficient privileges
'
*** 2011-09-28 15:42:05.565 61287 kcrr.c
Error 1031 received logging ON TO the standby
Error 1031 connecting TO destination LOG_ARCHIVE_DEST_1 standby host 'db1st'
kcrrwupirfs KCRROCIS Handle 0xfffffffffffb608 has NULL OCI servicehandle - Returning success
Error 1031 attaching TO destination LOG_ARCHIVE_DEST_1 standby host 'db1st'
ORA-01031: insufficient privileges
*** 2011-09-28 15:42:05.565 59530 kcrr.c
kcrrfail: dest:1 err:1031 force:0 blast:1
kcrrwupirfs KCRROCIS Handle 0xfffffffffffb608 has NULL OCI servicehandle - Returning success
kcrrwkx: UNKNOWN error:1031
ORA-16055: FAL request rejected
ARCH: Connecting TO console port...
ARCH: Connecting TO console port...
Redo shipping client performing standby login
OCISessionBegin failed -1
.. Detailed OCI error val IS 1031 AND errmsg IS 'ORA-01031: insufficient privileges
'
*** 2011-09-28 15:42:05.679 61287 kcrr.c
Error 1031 received logging ON TO the standby
Error 1031 connecting TO destination LOG_ARCHIVE_DEST_1 standby host 'db1st'
kcrrwupirfs KCRROCIS Handle 0xfffffffffffb608 has NULL OCI servicehandle - Returning success
Error 1031 attaching TO destination LOG_ARCHIVE_DEST_1 standby host 'db1st'
ORA-01031: insufficient privileges

导致ORA-1031错误的问题很简单,归档进程没有权限将归档日志写到远端,只需要将主库的密码文件拷贝到远端的对应位置上既可。
不过权限问题解决后,归档又出现新的错误:

Thu Sep 29 00:06:19 2011
ARC0: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16032)
ARC0: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
Thu Sep 29 00:06:19 2011
Errors IN file /oracle/admin/db1/bdump/db1_arc0_90394.trc:
ORA-16032: parameter destination string cannot be translated
FAL[server, ARC0]: Error 16032 creating remote archivelog file 'db1st'
Thu Sep 29 00:06:19 2011
ARC2: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16032)
ARC2: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
Thu Sep 29 00:06:19 2011
Errors IN file /oracle/admin/db1/bdump/db1_arc2_925904.trc:
ORA-16032: parameter destination string cannot be translated
FAL[server, ARC2]: Error 16032 creating remote archivelog file 'db1st'
FAL[server, ARC2]: FAL archive failed, see trace file.
Thu Sep 29 00:06:19 2011
Errors IN file /oracle/admin/db1/bdump/db1_arc2_925904.trc:
ORA-16055: FAL request rejected
ARCH: FAL archive failed. Archiver continuing
Thu Sep 29 00:06:19 2011
ORACLE Instance db1 - Archival Error. Archiver continuing.

对应的归档进程的详细错误:

*** 2011-09-29 00:06:19.480
Redo shipping client performing standby login
*** 2011-09-29 00:06:19.583 65194 kcrr.c
Logged ON TO standby successfully
Client logon AND security negotiation successful!
tkcrrsarc: (WARN) Failed TO find ARCH FOR message (message:0xa)
tkcrrpa: (WARN) Failed initial attempt TO send ARCH message (message:0xa)
tkcrrsarc: (WARN) Failed TO find ARCH FOR message (message:0xa)
tkcrrpa: (WARN) Failed initial attempt TO send ARCH message (message:0xa)
tkcrrsarc: (WARN) Failed TO find ARCH FOR message (message:0xa)
tkcrrpa: (WARN) Failed initial attempt TO send ARCH message (message:0xa)
tkcrrsarc: (WARN) Failed TO find ARCH FOR message (message:0x2)
tkcrrpa: (WARN) Failed initial attempt TO send ARCH message (message:0x2)
Error 16032 creating standby archive log file at host 'db1st'
*** 2011-09-29 00:06:19.843 61287 kcrr.c
ARC0: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16032)
*** 2011-09-29 00:06:19.843 61287 kcrr.c
ARC0: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
ORA-16032: parameter destination string cannot be translated
*** 2011-09-29 00:06:19.852 59530 kcrr.c
kcrrfail: dest:2 err:16032 force:0 blast:1
kcrrwkx: UNKNOWN error:16032
ORA-16055: FAL request rejected
ARCH: Connecting TO console port...
ARCH: Connecting TO console port...
FAL[server]: Selected inactive destination; resetting temporarily
Redo shipping client performing standby login
*** 2011-09-29 00:06:22.106 65194 kcrr.c
Logged ON TO standby successfully
Client logon AND security negotiation successful!
Error 16032 creating standby archive log file at host 'db1st'
*** 2011-09-29 00:06:22.157 61287 kcrr.c
ARC0: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (16032)
*** 2011-09-29 00:06:22.157 61287 kcrr.c
ARC0: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
ORA-16032: parameter destination string cannot be translated
*** 2011-09-29 00:06:22.157 59530 kcrr.c
kcrrfail: dest:2 err:16032 force:0 blast:1
kcrrwkx: UNKNOWN error:16032
ORA-16055: FAL request rejected
ARCH: Connecting TO console port...
ARCH: Connecting TO console port...
FAL[server]: Selected inactive destination; resetting temporarily
Redo shipping client performing standby login

其实导致这个ORA-16032错误的原因也很简单,从最前面的代码可以看出,在配置初始化参数的时候先配置了LOG_ARCHIVE_DEST_N,然后才配置LOG_ARCHIVE_CONFIG参数,这就使得LOG_ARCHIVE_DEST_N参数生效的时候无法从DG_CONFIG中获取到响应的配置,导致了ORA-16032的错误。
要避免这个错误很简单,只需要先配置LOG_ARCHIVE_CONFIG,然后再配置LOG_ARCHIVE_DEST_N参数既可。对于已经出现的这个错误,只需要通过参数LOG_ARCHIVE_DEST_STATE_N暂停日志,随后在启用,Oracle就会重新分析LOG_ARCHIVE_DEST_N中的配置:

Thu Sep 29 12:40:42 2011
Errors IN file /oracle/admin/db1/bdump/db1_arc0_90394.trc:
ORA-16032: parameter destination string cannot be translated
Thu Sep 29 12:40:42 2011
FAL[server, ARC0]: FAL archive failed, see trace file.
Thu Sep 29 12:40:42 2011
Errors IN file /oracle/admin/db1/bdump/db1_arc0_90394.trc:
ORA-16055: FAL request rejected
ARCH: FAL archive failed. Archiver continuing
Thu Sep 29 12:40:42 2011
ORACLE Instance db1 - Archival Error. Archiver continuing.
Thu Sep 29 12:46:24 2011
ALTER SYSTEM SET log_archive_dest_state_2='DEFER' SCOPE=BOTH;
Thu Sep 29 13:57:56 2011
FAL[server, ARC2]: FAL archive failed, see trace file.
Thu Sep 29 13:57:56 2011
Errors IN file /oracle/admin/db1/bdump/db1_arc2_925904.trc:
ORA-16055: FAL request rejected
ARCH: FAL archive failed. Archiver continuing
Thu Sep 29 13:57:56 2011
ORACLE Instance db1 - Archival Error. Archiver continuing.
Thu Sep 29 13:57:56 2011
ARCH: Possible network disconnect WITH PRIMARY DATABASE
Thu Sep 29 13:57:56 2011
ORACLE Instance db1 - Archival Error. Archiver continuing.
Thu Sep 29 13:58:26 2011
FAL[server]: Fail TO queue the whole FAL gap
GAP - thread 1 SEQUENCE 3357-3363
DBID 1488093310 branch 682019072
Thu Sep 29 14:09:42 2011
FAL[server, ARC0]: FAL archive failed, see trace file.
Thu Sep 29 14:09:42 2011
Errors IN file /oracle/admin/db1/bdump/db1_arc0_90394.trc:
ORA-16055: FAL request rejected
ARCH: FAL archive failed. Archiver continuing
Thu Sep 29 14:09:42 2011
ORACLE Instance db1 - Archival Error. Archiver continuing.
Thu Sep 29 14:09:42 2011
FAL[server, ARC2]: FAL archive failed, see trace file.
Thu Sep 29 14:09:42 2011
Errors IN file /oracle/admin/db1/bdump/db1_arc2_925904.trc:
ORA-16055: FAL request rejected
ARCH: FAL archive failed. Archiver continuing
Thu Sep 29 14:09:57 2011
ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=BOTH;
FAL[server]: Fail TO queue the whole FAL gap
GAP - thread 1 SEQUENCE 3361-3363
DBID 1488093310 branch 682019072
LNS1 started WITH pid=128, OS id=667772
Thu Sep 29 16:14:17 2011
Thread 1 advanced TO log SEQUENCE 3366
CURRENT log# 2 seq# 3366 mem# 0: /dev/rredo02
Thu Sep 29 16:14:18 2011
******************************************************************
LGWR: Setting 'active' archival FOR destination LOG_ARCHIVE_DEST_2
******************************************************************
LNS: Standby redo logfile selected FOR thread 1 SEQUENCE 3366 FOR destination LOG_ARCHIVE_DEST_2
Thu Sep 29 16:14:23 2011
FAL[server]: Fail TO queue the whole FAL gap
GAP - thread 1 SEQUENCE 3365-3365
DBID 1488093310 branch 682019072
Thu Sep 29 16:14:25 2011
ARC2: Standby redo logfile selected FOR thread 1 SEQUENCE 3365 FOR destination LOG_ARCHIVE_DEST_2

将参数重置后,错误消失,远端归档顺利开始。

Posted in ORACLE | Tagged , , , , , | Leave a comment

DATA GUARD出现ORA-26040错误

客户在搭建好的DATA GUARD环境出现了ORA-26040错误。
在备库只读打开时,出现了下面的错误:

ORA-12801: error signaled IN parallel query server P000
ORA-01578: ORACLE DATA block corrupted (file # 75, block # 1046445)
ORA-01110: DATA file 75: '/oradata2/accdb/a_txn_data09.dbf'
ORA-26040: DATA block was loaded USING the NOLOGGING OPTION
ORA-12801: error signaled IN parallel query server P000
ORA-01578: ORACLE DATA block corrupted (file # 75, block # 1046445)
ORA-01110: DATA file 75: '/oradata2/accdb/a_txn_data09.dbf'
ORA-26040: DATA block was loaded USING the NOLOGGING OPTION

这个错误其实很容易定位,由于建立DATA GUARD时没有开启FORCE LOGGING,而主库在某些表空间运行了直径路径或者NOLOGGING的操作,导致备库一些对象不可恢复。
其实对于这个错误可以很方便的解决,首先将主库和备库分别设置FORCE LOGGING。需要注意,一般在部署DATA GUARD时,只需要在主库进行FORCE LOGGING的设置,这是因为在标准流程中,这个设置在控制文件备份之前,如果在STANDBY控制文件产生之后才对主库进行设置,那么备库也需要进行同样的设置,因为FORCE LOGGING的设置是记录在控制文件中的。
下一步就是在主库重新备份报错的数据文件,然后关闭备库并启动到MOUNT状态,重新还原并恢复这些错误的数据文件既可。
对于绝大部分的情况,备库出现了故障后都没有必要完全重建,而只需要修复故障的部分既可。

Posted in ORACLE | Tagged , , , , | Leave a comment

《Oracle DBA手记3——数据库性能优化与内部原理解析》出版

发现最近总在打广告,等待已久的《Oracle DBA手记3——数据库性能优化与内部原理解析》一书终于出版了。
在前两天我们组织的Oracle技术嘉年华大会上,有一些朋友已经拿到了这本书,而我是今天刚刚才拿到手的。
DBA手记2在筹划阶段,Eygle就要我提供稿件,当时手头事情比较多,稿件的整理慢了一步,发给Eygle的时候,他已经将其他的稿件提交给出版社了,于是这些文章就放到了DBA手记3中。
无论是自夸也好,自谦也罢,书已经出版了,作者再说什么就都没有什么意义了,还是让读者去自己评判吧。

Posted in NEWS | Leave a comment

20111021 Oracle技术嘉年华落幕

为期两天的Oracle技术嘉年华今天结束了,准备了这么久,这么快就结束还真有点不适应。
当然这种会议不太可能和Oracle的Open World那样,一搞就是4天,短短两天会议对于我们的组织方来说,已经是不小的考验了。不过庆幸的是,直到最后一场演讲的结束,会场上的空位也并不太多。对于常参加会议的都比较清楚,一到下午最后一两场演讲,基本上很多人就会提前走,而到了会议的最后一天下午就更是如此,一般来说还有一半的人就很不错了,但是今天的会议到最后一个演讲之前,抽奖就完成了,即使这样,提前退场的也是寥寥无几。
作为会议的主办方,如果大家对于这次会议满意就是我们最大的欣慰了。在此感激所有参加这次会议的朋友们。

Posted in NEWS | Leave a comment

20111020 Oracle技术嘉年华首日

今天筹划已久的“Oracle技术嘉年华”(OTN China Tour 2011)在北京国宾饭店召开。
以前参加过很多次大会,直到这次才感受到成功举办一次这种规模的大会有多么困难,当然这也和我们公司在这方面没有太多经验有关系,幸好我们的联合主办方ITPUB&IT168在会议组织方面有比较丰富的经验,在一定程度上弥补了我们的不足。即便如此,很多同事为这次大会的召开仍然付出了大量的心血,由于前一段时间比较忙,自己在大会的准备工作上,并没有投入太多的精力,在这里对他们表示感谢。
今天会议的各个主题十分精彩,这与国内外演讲者的自身水平有很大关系,而且这场会议只有一两个赞助商,所以绝大部分场次的主题都是具有一定深度的技术分享,在一般会议中被广为诟病的产品广告被压缩到最小的程度。这也是我们这次会议之所以称谓技术嘉年华的主要原因。

Posted in NEWS | Leave a comment

ORA-600(729)错误

客户数据库出现ORA-600[729]错误。
详细信息如下:

Sun Oct 16 04:31:31 2011
Errors IN file /u01/app/oracle/admin/db1/udump/db1_ora_6459722.trc:
ORA-00600: internal error code, arguments: [729], [360], [SPACE leak], [], [], [], [], []

这个ORA-600的三个参数的含义分别为:729是错误的编号,729的错误和内存泄漏相关;360是内存泄漏的字节数;space leak:对于729错误,第三个参数永远是space leak,表明这是一个内存泄漏的问题。
正常情况下,Oracle的进程会自动管理当前进程内存的分配和释放,只有当进程结束的时候,Oracle会接管进程的内存,将其全部释放。在Oracle释放进程的内存时,会检查每个内存chunk的状态,正常情况下所有的状态都是free和permanent,如果出现了其他状态,比如FREEABLE或者RECREATABLE等,说明进程没有及时的释放一些内存chunk,这就是所谓的内存泄漏。
详细的TRACE如下:

*** 2011-10-16 04:31:31.115
*** SERVICE NAME:(SYS$USERS) 2011-10-16 04:31:31.114
*** SESSION ID:(6191.1172) 2011-10-16 04:31:31.114
******** ERROR: UGA memory leak detected 360 ********
******************************************************
HEAP DUMP heap name="session heap"  DESC=1104473d8
 extent sz=0xff80 alt=32767 het=32767 rec=0 flg=2 opc=3
 parent=110198a30 owner=7000013ff452bf0 nex=0 xsz=0xff80
EXTENT 0 addr=110490080
  Chunk        110490090 sz=     7800    perm      "perm           "  alo=4480
  Chunk        110491f08 sz=    57592    free      "               "
EXTENT 1 addr=110480080
  Chunk        110480090 sz=    16184    free      "               "
  Chunk        110483fc8 sz=      192    perm      "perm           "  alo=192
  Chunk        110484088 sz=    22136    free      "               "
  Chunk        110489700 sz=      360    freeable  "K2I iter       "
Dump OF memory FROM 0x0000000110489700 TO 0x0000000110489868
110489700 00B38F00 00000169 00000001 10484088  [.......i.....H@.]
110489710 00000001 04DEFFC0 0000BEA1 0000000A  [................]
110489720 00000008 00000000 00000001 104897E4  [.............H..]
110489730 00000001 10489824 00000000 00000000  [.....H.$........]
.
.
.
110489860 00000000 00000000                    [........]        
  Chunk        110489868 sz=    26520    free      "               "
EXTENT 2 addr=110460080
  Chunk        110460090 sz=     2200    perm      "perm           "  alo=2200
  Chunk        110460928 sz=    63192    free      "               "
EXTENT 3 addr=110450098
  Chunk        1104500a8 sz=    41392    perm      "perm           "  alo=41392
  Chunk        11045a258 sz=    23976    free      "               "
Total heap SIZE    =   261544
FREE LISTS:
 Bucket 0 SIZE=56
 Bucket 1 SIZE=88
 Bucket 2 SIZE=152
 Bucket 3 SIZE=168
 Bucket 4 SIZE=280
 Bucket 5 SIZE=432
 Bucket 6 SIZE=536
 Bucket 7 SIZE=1048
 Bucket 8 SIZE=2072
 Bucket 9 SIZE=4120
 Bucket 10 SIZE=8216
  Chunk        110480090 sz=    16184    free      "               "
 Bucket 11 SIZE=16408
  Chunk        11045a258 sz=    23976    free      "               "
  Chunk        110489868 sz=    26520    free      "               "
  Chunk        110484088 sz=    22136    free      "               "
 Bucket 12 SIZE=32792
  Chunk        110460928 sz=    63192    free      "               "
  Chunk        110491f08 sz=    57592    free      "               "
 Bucket 13 SIZE=65560
 Bucket 14 SIZE=131096
 Bucket 15 SIZE=262168
 Bucket 16 SIZE=524312
 Bucket 17 SIZE=2097176
Total free SPACE   =   209600
UNPINNED RECREATABLE CHUNKS (lru FIRST):
PERMANENT CHUNKS:
  Chunk        110490090 sz=     7800    perm      "perm           "  alo=4480
  Chunk        110483fc8 sz=      192    perm      "perm           "  alo=192
  Chunk        110460090 sz=     2200    perm      "perm           "  alo=2200
  Chunk        1104500a8 sz=    41392    perm      "perm           "  alo=41392
Permanent SPACE    =    51584
******************************************************
*** 2011-10-16 04:31:31.116
ksedmp: internal OR fatal error
ORA-00600: internal error code, arguments: [729], [360], [SPACE leak], [], [], [], [], []
----- Call Stack Trace -----
calling              CALL     entry                argument VALUES IN hex      
location             TYPE     point                (? means dubious VALUE)     
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              088424844 ? 041124844 ?
ksedmp+0290          bl       ksedst               104A506E8 ?
ksfdmp+0018          bl       03F4BE78             
kgeriv+0108          bl       _ptrgl               
kgesiv+0080          bl       kgeriv               10008FD0C ? 1100CAFF0 ?
                                                   000000000 ? 00000000A ?
                                                   1100CB230 ?
ksesic2+0060         bl       kgesiv               110000BF8 ? 104B032A8 ?
                                                   FFFFFFFFFFF9260 ? 000000000 ?
                                                   7000013FF453A10 ?
ksmuhe+02dc          bl       ksesic2              2D9000002D9 ? 000000000 ?
                                                   000000168 ? 000000001 ?
                                                   00000000A ? 104A4F2D8 ?
                                                   000000000 ? 000000000 ?
ksmugf+01b8          bl       03F49FA0             
ksuxds+0c04          bl       01FB5D60             
ksudel+0054          bl       ksuxds               7000013FF452BF0 ? 104B032A8 ?
opilof+0c7c          bl       03F49D84             
opiodr+0ae0          bl       _ptrgl               
ttcpip+1020          bl       _ptrgl               
opitsk+1124          bl       ttcpip               1100CB0D8 ? FFFFFFFFFFFBBB0 ?
                                                   000000001 ? 000000000 ?
                                                   FFFFFFFFFFFBB30 ? 1100E0428 ?
                                                   000000000 ? 1104B0B70 ?
opiino+0990          bl       opitsk               000000000 ? 000000000 ?
opiodr+0ae0          bl       _ptrgl               
opidrv+0484          bl       01FB7E64             
sou2o+0090           bl       opidrv               3C02AB90DC ? 440660000 ?
                                                   FFFFFFFFFFFF9D0 ?
opimai_real+01bc     bl       01FB5C2C             
main+0098            bl       opimai_real          000000000 ? 000000000 ?
__start+0098         bl       main                 000000000 ? 000000000 ?
--------------------- Binary Stack Dump ---------------------

在TRACE文件中,这个记录信息表明了内存泄漏的存在:

  Chunk        110489700 sz=      360    freeable  "K2I iter       "

而在TRACE文件中,还存在opilof函数,说明找个会话在进行LOGOFF操作。其实对于4000以下的内存泄漏,被认为是安全的,Oracle甚至提供了EVENTS 10262可以屏蔽这个错误的产生。而如果在alert文件中发现这个错误时,曾经泄漏的内存已经被释放,因此这个问题不会对实例的运行造成影响。
如果错误频繁发生,或者泄漏的空间超过4000或更高,可以进一步去分析,或者去寻找专门的补丁,Metalink文档Understanding and Diagnosing ORA-600 [729] Space Leak Errors [ID 403584.1]介绍了space leak相关的内容。

Posted in BUG | Tagged , , | Leave a comment

ORA-600(krfg_first_barrier1)错误

客户在打开FLASHBACK的时候碰到了这个错误。

详细的错误如下:

Tue Oct 18 10:55:02 2011
ALTER DATABASE MOUNT
Tue Oct 18 10:55:06 2011
Setting recovery target incarnation TO 1
Tue Oct 18 10:55:06 2011
Successful mount OF redo thread 1, WITH mount id 830459142
Tue Oct 18 10:55:06 2011
DATABASE mounted IN Exclusive Mode
Completed: ALTER DATABASE MOUNT
Tue Oct 18 13:05:08 2011
ALTER SYSTEM SET db_recovery_file_dest_size='200G' SCOPE=BOTH;
Tue Oct 18 13:05:13 2011
ALTER SYSTEM SET db_recovery_file_dest='/flashback' SCOPE=BOTH;
Tue Oct 18 13:05:17 2011
ALTER SYSTEM SET db_flashback_retention_target=4320 SCOPE=BOTH;
Tue Oct 18 13:06:00 2011
Shutting down instance: further logons disabled
Tue Oct 18 13:06:00 2011
Stopping background process CJQ0
Tue Oct 18 13:06:00 2011
Stopping background process MMNL
Tue Oct 18 13:06:01 2011
Stopping background process MMON
Tue Oct 18 13:06:02 2011
Shutting down instance (immediate)
License high water mark = 1
Tue Oct 18 13:06:02 2011
Stopping Job queue slave processes, flags = 7
Tue Oct 18 13:06:02 2011
Job queue slave processes stopped
Tue Oct 18 13:06:02 2011
ALTER DATABASE CLOSE NORMAL
ORA-1109 signalled during: ALTER DATABASE CLOSE NORMAL...
Tue Oct 18 13:06:02 2011
ALTER DATABASE DISMOUNT
Completed: ALTER DATABASE DISMOUNT
ARCH: Archival disabled due TO shutdown: 1089
Shutting down archive processes
Archiving IS disabled
Archive process shutdown avoided: 0 active
ARCH: Archival disabled due TO shutdown: 1089
Shutting down archive processes
Archiving IS disabled
Archive process shutdown avoided: 0 active
Tue Oct 18 13:07:29 2011
Starting ORACLE instance (normal)
sskgpgetexecname failed TO GET name
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune OF undo retention IS turned ON. 
IMODE=BR
ILAT =616
LICENSE_MAX_USERS = 0
SYS auditing IS disabled
ksdpec: called FOR event 13740 prior TO event GROUP initialization
Starting up ORACLE RDBMS Version: 10.2.0.4.0.
System parameters WITH non-DEFAULT VALUES:
_trace_files_public = TRUE
processes = 2800
sessions = 5600
timed_statistics = TRUE
__shared_pool_size = 905969664
shared_pool_size = 838860800
__large_pool_size = 16777216
__java_pool_size = 16777216
__streams_pool_size = 0
shared_pool_reserved_size= 83886080
nls_language = american
nls_territory = america
nls_sort = BINARY
nls_date_format = DD-MON-RR
nls_numeric_characters = .,
nls_comp = BINARY
nls_length_semantics = CHAR
sga_target = 8589934592
control_files = /oradata/accdb/cntrl01.dbf, /oradata/accdb/cntrl02.dbf, /oradata/accdb/cntrl03.dbf
db_block_checksum = TRUE
db_block_size = 8192
__db_cache_size = 7633633280
compatible = 10.2.0
log_archive_config = DG_CONFIG=(ACCDB,ACCPRI)
log_archive_dest_1 = LOCATION=/archlog
log_archive_dest_2 = SERVICE=ACCDB LGWR ASYNC VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=ACCDB
log_archive_dest_state_1 = enable
log_archive_dest_state_2 = enable
log_archive_max_processes= 3
fal_client = ACCPRI
fal_server = ACCDB
log_buffer = 14230528
log_checkpoint_interval = 100000
log_checkpoint_timeout = 1200
db_files = 512
db_recovery_file_dest = /flashback
db_recovery_file_dest_size= 214748364800
standby_file_management = AUTO
log_checkpoints_to_alert = FALSE
db_flashback_retention_target= 4320
dml_locks = 10000
undo_management = AUTO
undo_tablespace = APPS_UNDOTS1
db_block_checking = FALSE
db_domain = ccic
global_names = TRUE
session_cached_cursors = 500
utl_file_dir = /usr/tmp, /usr/tmp, /oracle/accdb/10.2.0/appsutil/outbound/ACCDB_accountdb, /usr/tmp
plsql_native_library_dir = /oracle/accdb/10.2.0/plsql/nativelib
plsql_native_library_subdir_count= 149
plsql_code_type = INTERPRETED
plsql_optimize_level = 2
job_queue_processes = 2
_system_trig_enabled = TRUE
cursor_sharing = EXACT
parallel_min_servers = 0
parallel_max_servers = 8
background_dump_dest = /oracle/admin/ACCDB/bdump
user_dump_dest = /oracle/admin/ACCDB/udump
max_dump_file_size = 20480
core_dump_dest = /oracle/admin/ACCDB/cdump
db_name = ACCDB
db_unique_name = ACCPRI
open_cursors = 1000
_sort_elimination_cost_ratio= 5
_b_tree_bitmap_plans = FALSE
_fast_full_scan_enabled = FALSE
_sqlexec_progression_cost= 2147483647
_like_with_bind_as_equality= TRUE
pga_aggregate_target = 4294967296
workarea_size_policy = AUTO
optimizer_secure_view_merging= FALSE
aq_tm_processes = 1
olap_page_pool_size = 41943040
PMON started WITH pid=2, OS id=3736548
PSP0 started WITH pid=3, OS id=5767232
MMAN started WITH pid=4, OS id=2556302
DBW0 started WITH pid=5, OS id=5112274
DBW1 started WITH pid=6, OS id=4457358
DBW2 started WITH pid=7, OS id=3604948
DBW3 started WITH pid=8, OS id=4915548
LGWR started WITH pid=9, OS id=3867168
CKPT started WITH pid=10, OS id=2818092
SMON started WITH pid=11, OS id=2949892
RECO started WITH pid=12, OS id=4063580
CJQ0 started WITH pid=13, OS id=4785010
MMON started WITH pid=14, OS id=4588312
MMNL started WITH pid=15, OS id=3015526
Tue Oct 18 13:07:31 2011
ALTER DATABASE MOUNT
Tue Oct 18 13:07:35 2011
Setting recovery target incarnation TO 1
Tue Oct 18 13:07:35 2011
Successful mount OF redo thread 1, WITH mount id 830491155
Tue Oct 18 13:07:35 2011
DATABASE mounted IN Exclusive Mode
Completed: ALTER DATABASE MOUNT
Tue Oct 18 13:07:55 2011
ALTER DATABASE flashback ON
Tue Oct 18 13:07:55 2011
db_recovery_file_dest_size OF 204800 MB IS 0.01% used. This IS a
user-specified LIMIT ON the amount OF SPACE that will be used BY this
DATABASE FOR recovery-related files, AND does NOT reflect the amount OF
SPACE available IN the underlying filesystem OR ASM diskgroup.
Allocated 15265600 bytes IN shared pool FOR flashback generation buffer
Starting background process RVWR
RVWR started WITH pid=17, OS id=4260856
Tue Oct 18 13:07:55 2011
Flashback DATABASE Enabled 
Tue Oct 18 13:07:55 2011
Errors IN file /oracle/admin/ACCDB/udump/accdb_ora_4129208.trc:
ORA-00600: internal error code, arguments: [krfg_first_barrier1], [1], [], [], [], [], [], []
ORA-600 signalled during: ALTER DATABASE flashback ON...

之所以要列出这么多的记录是有意义的,对于一般的ORA-600而言,只需要后面4行而已,但是对于当前的问题是不够的,事实上如果不是篇幅,我打算把整个alert文件中的内容都贴上。
简单介绍一下情况,客户打算在STANDBY数据库上打开FLASHBACK,这样备库可以打开进行读写修改,在修改完毕后,可以恢复到打开之前的状态,再次转换成STANDBY数据库的角色,而不需要由于备库的打开,重建整个STANDBY数据库。
本来这个操作在客户这边已经进行过几次,从来没有碰到过什么问题,不过这次这个数据库的STANDBY环境是新建立的,而FLASHBACK这种新技术,碰到什么bug也不奇怪。
发现这个错误后,首先去metalink上查询,结果发现这个错误在metalink上根本没有记录,看来这时一个出现几率很低的问题。虽然FLASHBACK用的人比较少,但是从10g出现,到11.2,还是经历了一段时间了,如果是常规的问题,不太可能没有任何记录。
详细的错误日志文件:
Dump file /oracle/admin/ACCDB/udump/accdb_ora_4129208.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 – 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/product/db10gr2
System name: AIX
Node name: Paccdb2
Release: 1
Version: 6
Machine: 00F67AD94C00
Instance name: ACCDB
Redo thread mounted by this instance: 1
Oracle process number: 16
Unix process pid: 4129208, image: oracle@Paccdb2 (TNS V1-V3)
*** ACTION NAME:() 2011-10-18 13:07:55.437
*** MODULE NAME:(sqlplus@Paccdb2 (TNS V1-V3)) 2011-10-18 13:07:55.437
*** SERVICE NAME:() 2011-10-18 13:07:55.437
*** SESSION ID:(5586.3) 2011-10-18 13:07:55.437
*** 2011-10-18 13:07:55.437
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [krfg_first_barrier1], [1], [], [], [], [], [], []
Current SQL statement for this session:
alter database flashback on
—– Call Stack Trace —–
calling call entry argument values in hex
location type point (? means dubious value)
——————– ——– ——————– —————————-
ksedst+001c bl ksedst1 000000000 ? 000000000 ?
ksedmp+0290 bl ksedst 104A52AA8 ?
ksfdmp+0018 bl 03F4E238
kgerinv+00dc bl _ptrgl
kgeasnmierr+004c bl kgerinv 000000000 ? 000000000 ?
000009732 ? 00147AE14 ?
700000010018078 ?
krfg_first_barrier_ bl kgeasnmierr 110196398 ? 110490040 ?
w_fba+00e8 104C03194 ? 100000001 ?
000000000 ? 000000001 ?
104C03D84 ? 1056A1010 ?
krfg_first_barrier+ bl krfg_first_barrier_ 000000000 ? 000000000 ?
001c w_fba 000000000 ? 004D25FF8 ?
krf_enable_flashbac bl krfg_first_barrier
k_db+0098
adbdrv+d9b0 bl krf_enable_flashbac 000000000 ?
k_db
opiexe+2c18 bl adbdrv
opiosq0+19f0 bl opiexe 000000000 ? 000000000 ?
FFFFFFFFFFF8EB0 ?
kpooprx+0168 bl opiosq0 3EC779438 ? 700000010003520 ?
7000001EC7793B8 ?
A4000110196398 ?
kpoal8+0400 bl kpooprx FFFFFFFFFFFB704 ?
FFFFFFFFFFFB470 ?
1B0000001B ? 100000001 ?
000000000 ? A40000000000A4 ?
000000000 ? 1103A80D8 ?
opiodr+0ae0 bl _ptrgl
ttcpip+1020 bl _ptrgl
opitsk+1124 bl ttcpip 1100CB7A0 ? FFFFFFFFFFFBA00 ?
000000000 ? 11044F7B0 ?
FFFFFFFFFFFB970 ?
442442226B736800 ?
9000000000D9B74 ? 110451E00 ?
opiino+0990 bl opitsk 1E00000000 ? 000000000 ?
opiodr+0ae0 bl _ptrgl
opidrv+0484 bl 01FBA1C8
sou2o+0090 bl opidrv 3C02ABB45C ? 4A0071EB0 ?
FFFFFFFFFFFF640 ?
opimai_real+01bc bl 01FB7F8C
main+0098 bl opimai_real 000000000 ? 000000000 ?
__start+0070 bl main 000000000 ? 000000000 ?
——————— Binary Stack Dump ———————
简略的看了一下TRACE文件,并没有发现可疑的地方,查询了一下数据库的状态:

SQL> SELECT name, log_mode, controlfile_type, open_mode, database_role, flashback_on FROM v$database;
NAME       LOG_MODE        CONTROLFILE_TY OPEN_MODE  DATABASE_ROLE   FLASHBACK_ON
---------- --------------- -------------- ---------- --------------- ---------------
ACCDB      ARCHIVELOG      BACKUP         MOUNTED    PRIMARY         YES

虽然报了ORA-600错误,但是FLASHBACK已经打开了,而且在对应的未知也发现FLASHBACK日志文件成功创建。此外还有一个疑点,就是控制文件和数据库的角色不正确,按道理来说,控制文件的状态应该是STANDBY,而数据库的角色也应该是STANDBY。于是和具体操作的DBA沟通了一下,发现了问题的所在。
原来这个STANDBY环境还根本没有建立起来,现在数据库只是利用BACKUP的控制文件恢复了控制文件而已,数据库的数据文件还根本就没有恢复,也就是说不但没有建立成STANDBY环境,数据库本身还不存在,这时候开启FLASHBACK当然会有问题。
正常应该先创建STANDBY环境,确认主库和备库之间关系建立且日志应用正常,然后在备库上开启FLASHBACK。这样看来,这个ORA-600错误在metalink上找不到也是正常的,毕竟这个错误太低级了。
不过Oracle应该将错误信息更明确一些,而不是报一个一般人都看不懂的ORA-600错误就完事了,从这个角度讲,这个问题也算是一个小bug。

Posted in BUG | Tagged , , | Leave a comment

创建磁盘组时出现ORA-15238错误

测试不同的AU_SIZE的影响时,发现了这个错误。
建立AU_SIZE大小为64M的磁盘组出现下面的错误:

SQL> CREATE diskgroup data_au64m external redundancy disk '/dev/asm-disk3' attribute 'au_size'='64m','compatible.asm'='11.2','compatible.rdbms'='11.2';
ORA-15018: diskgroup cannot be created
ORA-15238: 64m IS NOT a valid VALUE FOR attribute au_size

开始还以为是Oracle对于64m的大小有限制,并非所有的系统都支持。因为对于32K的BLOCK_SIZE就是这样,Windows环境和Linux环境都是不支持的,莫非64m的AU_SIZE也是当前Linux系统所不支持的。
后来发现原来并非是系统不支持,导致错误的原因竟然是m没有大写,将语法修改后,磁盘组创建成功:

SQL> CREATE DISKGROUP DATA_AU64M EXTERNAL REDUNDANCY DISK '/dev/asm-disk1' ATTRIBUTE 'au_size'='64M',
'compatible.asm' = '11.2.0.0.0', 'compatible.rdbms' = '11.2.0.0.0'
Diskgroup created.

没想到一个表示大小的单位而已,Oracle也要区分大小写,而且还没有明确的错误提示,这块的实现有点太不人性化了。

Posted in ORACLE | Tagged , , , | Leave a comment