ORA-600(krd_check_slaves_1)错误

物理STANDBY环境在进行恢复时碰到这个错误。

详细错误信息为:

<pre lang=’SQL’>Sun Sep 11 06:59:47 2011

alter database recover datafile list clear

Sun Sep 11 06:59:47 2011

Completed: alter database recover datafile list clear

Sun Sep 11 06:59:47 2011

alter database recover datafile list

1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18

Completed: alter database recover datafile list

1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18

Sun Sep 11 06:59:47 2011

alter database recover if needed

start until time ‘SEP 04 2011 06:00:54’ using backup controlfile

Media Recovery Start

Sun Sep 11 06:59:47 2011

Media Recovery failed with error 1124

ORA-283 signalled during: alter database recover if needed

start until time ‘SEP 04 2011 06:00:54′ using backup controlfile

Sun Sep 11 07:38:12 2011

Errors in file /oracle/admin/ccicmix/udump/ccicmix_ora_9699350.trc:

ORA-00600: internal error code, arguments: [krd_check_slaves_1], [], [], [],
[], [], [], []

Sun Sep 11 07:38:18 2011

Recovery interrupted!</pre>

显然这里的ORA-600错误和前面的恢复操作有直接关系。

<pre lang=’SQL’>ORA-00283 recovery session canceled due to errors

Cause: An
error during recovery was determined to be fatal enough to end the current
recovery session.

Action: More specific messages
will accompany this message. Refer to the other messages for the appropriate
action.

ORA-01124: cannot recover data file string – file is in use or recovery

Cause: An attempt to do media recovery found that the file was not available
for recovery. Either it is online and the database is open in some instance, or
another process is curently doing media recovery on the file.

Action: Do not do media recovery.</pre>

这是ORA-600错误中,ORA-283和ORA-1124错误的官方描述,这两个错误的信息都说明,在恢复过程中出现异常。

检查了metalink,发现这是一个bug,对应的文章信息为:Ora 600 Raised During Recover
Until Time When Archive Required Not Present [ID 406263.1]。

当对数据库进行基于时间的不完全恢复,开始并行恢复后,就可能碰到这个bug,而且是10.2.0.2以上的10g版本都有可能碰到。

如果不能通过应用补丁来解决这个bug,那么考虑临时关闭并行恢复应该也是一个解决方法。

 

Posted in BUG | Tagged , , , | Leave a comment

ORA-600(kmgs_pre_process_request_6)错误

这个bug比较神奇,如果SGA设置为4G的整数倍,就会导致系统CRASH。以前听说过这个bug,这次还是第一次碰到。

详细的错误如下:

<pre lang=’SQL’>Tue Oct 11 10:44:34 2011

Starting ORACLE instance (normal)

Tue Oct 11 10:44:34 2011

Specified value of sga_max_size is too small, bumping to 12884901888

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

Picked latch-free SCN scheme 3

Autotune of undo retention is turned on.

IMODE=BR

ILAT =61

LICENSE_MAX_USERS = 0

SYS auditing is disabled

ksdpec: called for event 13740 prior to event group initialization

Starting up ORACLE RDBMS Version: 10.2.0.3.0.

System parameters with non-default values:

processes = 500

sga_max_size = 12884901888

lock_sga = FALSE

__shared_pool_size = 570425344

shared_pool_size = 536870912

__large_pool_size = 134217728

large_pool_size = 134217728

__java_pool_size = 33554432

java_pool_size = 33554432

__streams_pool_size = 0

streams_pool_size = 0

sga_target = 12884901888

control_files = /dev/rcontrol01, /dev/rcontrol02, /dev/rcontrol03

db_file_name_convert = /dev/, /dev/

log_file_name_convert = /dev/, /dev/

db_block_size = 8192

__db_cache_size = 12129927168

db_cache_size = 2147483648

compatible = 10.2.0.3.0

log_archive_config = DG_CONFIG=(oradb,orapri)

log_archive_dest_1 = location=/oraarch

log_archive_dest_2 = service=oradb lgwr async
valid_for=(ONLINE_LOGFILES,PRIMARY_ROLE) db_unique_name=oradb

log_archive_dest_state_1 = enable

log_archive_dest_state_2 = enable

log_archive_max_processes= 3

fal_client = orapri

fal_server = oradb

log_buffer = 14234624

db_file_multiblock_read_count= 64

db_recovery_file_dest =

db_recovery_file_dest_size= 107374182400

standby_file_management = AUTO

fast_start_mttr_target = 300

db_flashback_retention_target= 4320

undo_management = AUTO

undo_tablespace = UNDOTBS1

undo_retention = 3600

remote_login_passwordfile= EXCLUSIVE

db_domain =

utl_file_dir = /oracle

job_queue_processes = 0

background_dump_dest = /oracle/admin/oradb/bdump

user_dump_dest = /oracle/admin/oradb/udump

core_dump_dest = /oracle/admin/oradb/cdump

audit_file_dest = /oracle/admin/oradb/adump

sort_area_size = 1048576

db_name = oradb

db_unique_name = orapri

open_cursors = 300

_b_tree_bitmap_plans = FALSE

optimizer_index_cost_adj = 10

pga_aggregate_target = 2147483648

PMON started with pid=2, OS id=16121944

PSP0 started with pid=3, OS id=14286974

MMAN started with pid=4, OS id=11141216

DBW0 started with pid=5, OS id=14418082

DBW1 started with pid=6, OS id=14155834

DBW2 started with pid=7, OS id=11862130

LGWR started with pid=8, OS id=13828256

CKPT started with pid=9, OS id=14024704

SMON started with pid=10, OS id=12451932

RECO started with pid=11, OS id=15073406

MMON started with pid=12, OS id=12582914

MMNL started with pid=13, OS id=12517546

Tue Oct 11 10:45:03 2011

Errors in file /oracle/admin/oradb/bdump/oradb_mman_11141216.trc:

ORA-00600: internal error code, arguments: [kmgs_pre_process_request_6], [6],
[723], [128], [3], [0x70000030630E608], [], []

Tue Oct 11 10:45:04 2011

Errors in file /oracle/admin/oradb/bdump/oradb_mman_11141216.trc:

ORA-00600: internal error code, arguments: [kmgs_pre_process_request_6], [6],
[723], [128], [3], [0x70000030630E608], [], []

Tue Oct 11 10:45:04 2011

MMAN: terminating instance due to error 822

Instance terminated by MMAN, pid = 11141216</pre>

Oracle数据库重启后还没有做任何的操作,就被mman进程中止了。显然这时Oracle自己的问题,而与应用无关。

在metalink上对应的文档为MMON Terminates Instance With ORA-600 [kmgs_pre_process_request_6]
or ORA-600 [kmgs_update_target_size_1] When Resizing Caches [ID 373802.1],这个错误影响的版本是10.2.0.1到10.2.0.3,而当前正是10.2.0.3这个版本。

正如文章开头提到的,当SGA_TARGET是4G的整数倍时,就会引发这个问题。解决这个问题倒是不复杂,除了升级到10.2.0.4以上版本或者打Bug:4433838对应的patch外,稍微调整一下SGA的大小,就可以避免这个错误的产生。

 

 

 

Posted in BUG | Tagged , , , , | Leave a comment

OEM模块审计查询语句占用较大资源

发现一个导致Oracle自己的低效SQL,在AWR报告中占用较大的资源。

这个SQL在Elapsed Time、CPU Time、User I/O Wait Time、Buffer Gets、Physical Reads都会出现,其SQL模块是Oracle Enterprise Manager.Metric Engine。显然这是一个OEM自己的SQL,检查完整SQL语句:

<pre lang=’SQL’>SELECT TO_CHAR(current_timestamp AT TIME ZONE ‘GMT’, ‘YYYY-MM-DD
HH24:MI:SS TZD’) AS curr_timestamp, COUNT(username) AS failed_count FROM
sys.dba_audit_session WHERE returncode != 0 AND TO_CHAR(timestamp, ‘YYYY-MM-DD
HH24:MI:SS’) >= TO_CHAR(current_timestamp – TO_DSINTERVAL(‘0 0:30:00’),
‘YYYY-MM-DD HH24:MI:SS’)</pre>

这个检查DBA_AUDIT_SESSION的SQL语句写法很烂,以致于一开始我还不太相信是出自Oracle,不过Oracle Enterprise Manager.Metric Engine的MODULE
NAME已经说明了问题,Oracle居然自己违反对列进行操作以及不必要的转换原则。

此外,这个SQL基本上没有办法使用索引,如果DBA_AUDIT_SESSION中记录很多,那么这个SQL会非常耗时,这就是不算是OEM的bug,至少也是一种设计缺陷。

 

Posted in BUG | Tagged , , | Leave a comment

数据库出现很高的enq: DX – contention 等待

检查客户AWR报告时,发现数据库经历严重的enq: DX – contention等待。

这个数据库的版本是11.1.0.6 for Windows 64bits for AMD,在AWR报告中,TOP 5等待事件为:

Event

Waits

Time(s)

Avg wait (ms)

% DB time

Wait Class

enq: DX – contention

22,627

135,751

6000

126.41

Other

inactive transaction branch

22,625

22,626

1000

21.07

Other

db file scattered read

1,312,470

4,212

3

3.92

User I/O

DB CPU

 

3,928

 

3.66

 

db file sequential read

1,349,206

3,522

3

3.28

User I/O

其中以enq: DX – contention等待最为严重,其实是inactive transaction branch。而这两个等待事件平均等待时间分别是6秒和1秒,平均等待时间这么整,显然不是偶然,这说明这两个等待时间都经历了各自的超时时间。

查询metalink,这个问题也是BUG所致,在文档High CPU by Sessions Holding DX Enqueue; Others Waiting ‘enq: DX – contention’ [ID 1275884.1]中描述了这个错误。

导致问题的原因是一些持有分布式事务队列的用户尝试中止或退出分布式事务,Oracle在处理的时候占用了大量的CPU资源。

而用来判断是否出现这个问题的标志是,数据库中出现了明显的enq: DX – contention和inactive transaction branch等待事件。

这个问题影响10.2和11.1版本,在11.2中被FIXED。可以通过打单独的8552942的补丁来解决这个问题。

 

 

Posted in BUG | Tagged , | Leave a comment

LMD出现kjxocdr: drop duplicate open信息

在LMD进程对应的TRACE文件lmd0中,发现了这个信息。
这是一个10.2.0.4 RAC环境,详细的信息如下:

*** 2011-10-11 16:49:35.171
kjxocdr: DROP duplicate OPEN [0x1a][0x0],[US].148 0x0xdba899230 [held 5][req 5]
kjxocdr: DROP duplicate OPEN [0x1a][0x0],[US].148 0x0xdba899230 [held 5][req 5]
kjxocdr: DROP duplicate OPEN [0x1a][0x0],[US].148 0x0xdba899230 [held 5][req 5]
kjxocdr: DROP duplicate OPEN [0x1a][0x0],[US].148 0x0xdba899230 [held 5][req 5]
kjxocdr: DROP duplicate OPEN [0x1a][0x0],[US].148 0x0xdba899230 [held 5][req 5]
.
.
.

查询发现有两个BUG和当前问题相关。第一篇文档是Bug 9093300 – Repeated “kjxocdr: drop duplicate open” messages in LMD trace [ID 9093300.8]。其中问题影响的版本包括10.2.0.4。这个问题在10.2.0.4.3和10.2.0.5中被解决。
但是还有另外一篇文章Bug 9637510 – LMD0 trace file is filled with messages “drop duplicate open” [ID 9637510.8]同样描述的这个错误。其影响版本是10.2.0.4,而且文档中明确说明,即使打了9093300的补丁,问题仍然会出现。这个9637510的BUG在11.1.0.6中被解决。
由于两个bug和当前的现象都相符,无法确认当前的问题到底属于哪个bug,不过两个bug有一点描述是相同的,就是对于这个信息可以忽略掉。
这说明这个bug的影响不过是在LMD进程文件中多写入了一些不必要的内容,而对于系统本身并没有什么影响。

Posted in BUG | Tagged , , | Leave a comment

ORA-600(2103)错误

检查客户数据库告警日志时,发现ORA-600(2103)错误。
这是一个9208 RAC for Linux 32bit环境,当出现这个错误时,LGWR进程CRASH,导致数据库实例DOWN掉:

Sun DEC 7 18:15:02 2008
Errors IN file /u01/product/admin/RAC/bdump/rac2_lgwr_491766.trc:
ORA-00600: internal error code, arguments: [2103], [0], [0], [1], [900], [], [], []
Sun DEC 7 18:15:03 2008
Errors IN file /u01/product/admin/RAC/bdump/rac2_lgwr_491766.trc:
ORA-00600: internal error code, arguments: [2103], [0], [0], [1], [900], [], [], []
Sun DEC 7 18:15:03 2008
LGWR: terminating instance due TO error 600
Sun DEC 7 18:15:08 2008
System state dump IS made FOR LOCAL instance
Sun DEC 7 18:15:09 2008
Trace dumping IS performing id=[cdmp_20081207181504]
DIAG: terminating instance due TO error 1092
Instance TERMINATED BY DIAG, pid = 680052

虽然这已经是三年前发生的事情了,但是考虑到客户数据库的版本并未升级,因此还是要关注一下问题。
对于9.2及以上版本,参数2103后面紧跟的第一位参数是更新/只读标识,第二位是队列标识,第三位是是否等待标识,最后一位是TIMEOUT时间。
简单的说,Oracle等待控制文件队列超时,并最终导致了LGWR进程CRASH。
根据目前的信息判断,问题多半和18:00启动的定时任务有关,这个任务可能是EXP或RMAN备份,可能由于大量的IO或对控制文件的锁定,导致了这个错误的产生。

Posted in BUG | Tagged , | Leave a comment

ORA-600(ksxprqfre3)错误

客户的9208环境下的RAC出现这个错误。
详细错误信息为:

Errors IN file /u01/product/admin/RAC/bdump/rac1_ora_708706.trc:
ORA-00600: internal error code, arguments: [ksxprqfre3], [0x110320F28], [], [], [], [], [], []
ORA-10382: parallel query server interrupt (reset)
.
.
.
Errors IN file /u01/product/admin/RAC/bdump/rac1_p002_1212438.trc:
ORA-00600: internal error code, arguments: [ksxprqfre3], [0x110319308], [], [], [], [], [], []

错误信息包括两类,一类是ora客户进程出现了错误,但是ORA-10382错误和平行查询有关,另一类直接发生在平行进程上。
查询metalink发现,问题和Bug 6521632描述的十分类似,同样是9208的RAC环境,问题同样发生在并行执行过程中。另外比较重要的一点,当前环境和Bug中描述的一样在CLUSTER之间使用了UDP,而这很可能就是导致问题的原因:

cluster interconnect IPC version:Oracle UDP/IP
IPC Vendor 1 proto 2 Version 1.0

由于Oracle已经不再支持9208环境,因此对于这种未完全确认的bug也不会有进一步的支持,因此除了升级到新版本外,避免并行查询可能是更合理的避免错误方式。此外,这个错误的发生频率不高,且对数据库运行影响不大,完全可以暂时忽略。

Posted in BUG | Tagged , , | Leave a comment

系统统计信息的保存位置

从9i开始,Oracle就提供了收集系统统计信息的能力,不过直到11.2的文档中,Oracle都从来没有提过统计信息可以从哪个视图中获取。
虽然Oracle提供了一个GET_SYSTEM_STATS的功能,但是使用非常不方便,不如直接查看视图那么简单直观。
既然Oracle自己没有说明,可以通过TRACE的方式来获取Oracle将统计信息存储到哪个基表中:

SQL> ALTER SESSION SET EVENTS '10046 TRACE NAME CONTEXT FOREVER, LEVEL 12';
会话已更改。
SQL> EXEC DBMS_STATS.GATHER_SYSTEM_STATS
PL/SQL 过程已成功完成。
SQL> ALTER SESSION SET EVENTS '10046 TRACE NAME CONTEXT OFF';
会话已更改。
SQL> SELECT SPID 
  2  FROM V$SESSION S, V$PROCESS P
  3  WHERE S.PADDR = P.ADDR
  4  AND SID IN 
  5  (SELECT SID
  6  FROM V$MYSTAT
  7  WHERE ROWNUM = 1);
SPID
------------
17816

查看对应的TRACE信息:

*** 2011-10-10 08:32:39.008
*** ACTION NAME:() 2011-10-10 08:32:38.847
*** MODULE NAME:(SQL*Plus) 2011-10-10 08:32:38.847
*** SERVICE NAME:(SYS$USERS) 2011-10-10 08:32:38.847
*** SESSION ID:(144.3) 2011-10-10 08:32:38.847
WAIT #9: nam='SQL*Net message to client' ela= 4 driver id=1111838976 #bytes=1 p3=0 obj#=358 tim=68276628744
WAIT #9: nam='SQL*Net message from client' ela= 271 driver id=1111838976 #bytes=1 p3=0 obj#=358 tim=68276853358
=====================
.
.
.
=====================
PARSING IN CURSOR #7 len=44 dep=0 uid=57 oct=47 lid=57 tim=68330942448 hv=3102768059 ad='242ca1cc'
BEGIN DBMS_STATS.GATHER_SYSTEM_STATS; END;
END OF STMT
PARSE #7:c=0,e=2137,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,tim=68330942444
BINDS #7:
=====================
PARSING IN CURSOR #8 len=80 dep=1 uid=0 oct=3 lid=0 tim=68331143167 hv=374192777 ad='2849d9a0'
SELECT /*+ rule */ VALUE FROM V$PARAMETER WHERE UPPER(NAME) = 'STATISTICS_LEVEL'
END OF STMT
PARSE #8:c=0,e=21803,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=3,tim=68331143162
BINDS #8:
EXEC #8:c=15601,e=33726,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=3,tim=68331248401
FETCH #8:c=0,e=8267,p=0,cr=0,cu=0,mis=0,r=1,dep=1,og=3,tim=68331272092
=====================
.
.
.
=====================
PARSING IN CURSOR #10 len=56 dep=1 uid=57 oct=7 lid=57 tim=68354281222 hv=311428414 ad='2853a860'
DELETE FROM SYS.AUX_STATS$ WHERE SNAME = 'SYSSTATS_INFO'
END OF STMT
PARSE #10:c=31200,e=117964,p=12,cr=157,cu=0,mis=1,r=0,dep=1,og=1,tim=68354281219
BINDS #10:
EXEC #10:c=0,e=24330,p=0,cr=1,cu=8,mis=0,r=4,dep=1,og=1,tim=68354305618
=====================
PARSING IN CURSOR #9 len=73 dep=1 uid=57 oct=2 lid=57 tim=68354305942 hv=454001916 ad='2853a638'
INSERT INTO SYS.AUX_STATS$ VALUES ('SYSSTATS_INFO', 'STATUS', NULL, :B1 )
END OF STMT
PARSE #9:c=0,e=187,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=1,tim=68354305940
BINDS #9:
kkscoacd
Bind#0
oacdty=01 mxl=32(30) mxlc=00 mal=00 scl=00 pre=00
oacflg=13 fl2=206001 frm=01 csi=852 siz=32 off=0
kxsbbbfp=0565dbe4 bln=32 avl=09 flg=09
VALUE="COMPLETED"
WAIT #9: nam='db file sequential read' ela= 18196 file#=1 block#=4537 blocks=1 obj#=572 tim=68354324705
EXEC #9:c=0,e=19012,p=1,cr=1,cu=3,mis=1,r=1,dep=1,og=1,tim=68354325026
=====================
PARSING IN CURSOR #11 len=73 dep=1 uid=57 oct=2 lid=57 tim=68354325455 hv=3821777836 ad='2853a410'
INSERT INTO SYS.AUX_STATS$ VALUES ('SYSSTATS_INFO', 'DSTART', NULL, :B1 )
END OF STMT
PARSE #11:c=0,e=304,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=1,tim=68354325452
BINDS #11:
kkscoacd
Bind#0
oacdty=01 mxl=32(30) mxlc=00 mal=00 scl=00 pre=00
oacflg=13 fl2=206001 frm=01 csi=852 siz=32 off=0
kxsbbbfp=0565dc04 bln=32 avl=16 flg=09
VALUE="10-10-2011 08:33"
EXEC #11:c=0,e=657,p=0,cr=1,cu=3,mis=1,r=1,dep=1,og=1,tim=68354326236
=====================
PARSING IN CURSOR #12 len=72 dep=1 uid=57 oct=2 lid=57 tim=68354326455 hv=912777027 ad='2853a1e8'
INSERT INTO SYS.AUX_STATS$ VALUES ('SYSSTATS_INFO', 'DSTOP', NULL, :B1 )
END OF STMT
PARSE #12:c=0,e=148,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=1,tim=68354326452
BINDS #12:
kkscoacd
Bind#0
oacdty=01 mxl=32(30) mxlc=00 mal=00 scl=00 pre=00
oacflg=13 fl2=206001 frm=01 csi=852 siz=32 off=0
kxsbbbfp=0565dc24 bln=32 avl=16 flg=09
VALUE="10-10-2011 08:33"
EXEC #12:c=0,e=419,p=0,cr=1,cu=3,mis=1,r=1,dep=1,og=1,tim=68354326949
=====================
.
.
.
=====================
PARSING IN CURSOR #26 len=6 dep=1 uid=57 oct=44 lid=57 tim=68354566045 hv=255718823 ad='0'
COMMIT
END OF STMT
PARSE #26:c=0,e=11,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=0,tim=68354566043
XCTEND rlbk=0, rd_only=0
EXEC #26:c=0,e=178,p=0,cr=0,cu=1,mis=0,r=0,dep=1,og=0,tim=68354566282
EXEC #7:c=530404,e=23458716,p=42,cr=578,cu=75,mis=0,r=1,dep=0,og=1,tim=68354566323
WAIT #7: nam='log file sync' ela= 855 buffer#=1807 sync scn=1998734 p3=0 obj#=-1 tim=68354567211
WAIT #7: nam='SQL*Net message to client' ela= 1 driver id=1111838976 #bytes=1 p3=0 obj#=-1 tim=68354567508
WAIT #7: nam='SQL*Net message from client' ela= 333 driver id=1111838976 #bytes=1 p3=0 obj#=-1 tim=68354567875
=====================
PARSING IN CURSOR #27 len=52 dep=0 uid=57 oct=47 lid=57 tim=68354567976 hv=1029988163 ad='28480080'
BEGIN DBMS_OUTPUT.GET_LINES(:LINES, :NUMLINES); END;
END OF STMT
PARSE #27:c=0,e=29,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,tim=68354567974
BINDS #27:
kkscoacd
Bind#0
oacdty=123 mxl=4000(4000) mxlc=00 mal=00 scl=00 pre=00
oacflg=00 fl2=1000000 frm=00 csi=00 siz=4000 off=0
kxsbbbfp=056e555c bln=4000 avl=00 flg=15
Bind#1
oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
oacflg=01 fl2=1000000 frm=00 csi=00 siz=24 off=0
kxsbbbfp=0ab6bde8 bln=22 avl=22 flg=05
VALUE=###
An invalid NUMBER has been seen.Memory contents are :
Dump OF memory FROM 0x0AB6BDE8 TO 0x0AB6BDFE
AB6BDE0 000010C1 00000000 [........]
AB6BDF0 00000000 00000000 00000000 00000000 [................]
WAIT #27: nam='SQL*Net message to client' ela= 1 driver id=1111838976 #bytes=1 p3=0 obj#=-1 tim=68354568397
EXEC #27:c=0,e=416,p=0,cr=3,cu=0,mis=0,r=1,dep=0,og=1,tim=68354568514
*** 2011-10-10 08:34:55.278
WAIT #27: nam='SQL*Net message from client' ela= 58488791 driver id=1111838976 #bytes=1 p3=0 obj#=-1 tim=68413057389
=====================
PARSING IN CURSOR #25 len=55 dep=0 uid=57 oct=42 lid=57 tim=68413057883 hv=524428051 ad='0'
ALTER SESSION SET EVENTS '10046 TRACE NAME CONTEXT OFF'
END OF STMT
PARSE #25:c=0,e=132,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=0,tim=68413057880
EXEC #25:c=0,e=163,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=0,tim=68413058129

可以从TRACE结果很容易的得到,Oracle将统计信息放入了SYS.AUX_STATS$表中。
可惜的是,Oracle似乎并没有提供一个视图给所有的用户进行访问,因此只有权限用户才能直接访问这个内部表:

SQL> SELECT * FROM SYS.AUX_STATS$;
SNAME                PNAME                PVAL1 PVAL2
-------------------- --------------- ---------- ------------------------
SYSSTATS_INFO        STATUS                     COMPLETED
SYSSTATS_INFO        DSTART                     09-10-2011 08:33
SYSSTATS_INFO        DSTOP                      09-10-2011 08:33
SYSSTATS_INFO        FLAGS                    1
SYSSTATS_MAIN        CPUSPEEDNW        1555.208
SYSSTATS_MAIN        IOSEEKTIM           11.312
SYSSTATS_MAIN        IOTFRSPEED            4096
SYSSTATS_MAIN        SREADTIM
SYSSTATS_MAIN        MREADTIM
SYSSTATS_MAIN        CPUSPEED
SYSSTATS_MAIN        MBRC
SYSSTATS_MAIN        MAXTHR
SYSSTATS_MAIN        SLAVETHR
已选择13行。

而如果使用DBMS_STATS.GET_SYSTEM_STATS则显然要麻烦得多:

SQL> SET SERVEROUT ON SIZE 100000
SQL> DECLARE
  2   V_STATUS VARCHAR2(30);
  3   V_DSTART DATE;
  4   V_DSTOP DATE;
  5   V_PVALUE NUMBER;
  6  BEGIN
  7   DBMS_STATS.GET_SYSTEM_STATS(V_STATUS, V_DSTART, V_DSTOP, 'cpuspeednw', V_PVALUE);
  8   DBMS_OUTPUT.PUT_LINE(RPAD('STATUS', 31) || RPAD('START', 24) || RPAD('STOP', 24) || 'VALUE');
  9   DBMS_OUTPUT.PUT_LINE(RPAD(V_STATUS, 31) || RPAD(TO_CHAR(V_DSTART, 'YYYY-MM-DD HH24:MI:SS'), 24)
 10    || RPAD(TO_CHAR(V_DSTOP, 'YYYY-MM-DD HH24:MI:SS'), 24) || V_PVALUE);
 11  END;
 12  /  
STATUS                         START                   STOP                    VALUE
COMPLETED                      2011-10-09 08:33:00     2011-10-09 08:33:00     1555.208
PL/SQL 过程已成功完成。
Posted in ORACLE | Tagged , , | Leave a comment

ORA-600(kdsgrp1)错误

在一个客户的数据库告警日志中,频繁出现ORA-600(kdsgrp1)错误。
详细错误信息为:

/oracle/admin/tjsb/udump/tjsb1_ora_1701138.trc
Oracle DATABASE 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
WITH the Partitioning, REAL Application Clusters, DATA Mining AND REAL Application Testing options
ORACLE_HOME = /oracle/product/10.2.0/db_1
System name:    AIX
Node name:      p570c
Release:        3
Version:        5
Machine:        00C971744C00
Instance name: tjsb1
Redo thread mounted BY this instance: 1
Oracle process NUMBER: 883
Unix process pid: 1701138, image: oracle@p570c
*** 2011-07-26 09:59:41.111
*** ACTION NAME:(abc) 2011-07-26 09:59:41.105
*** MODULE NAME:(archive.exe) 2011-07-26 09:59:41.105
*** SERVICE NAME:(tjsb) 2011-07-26 09:59:41.105
*** SESSION ID:(2466.64103) 2011-07-26 09:59:41.105
            ROW 0766ce34.a continuation at
            file# 29 block# 2543156 slot 11 NOT found
**************************************************
KDSTABN_GET: 0 ..... ntab: 1
curSlot: 11 ..... nrows: 110
**************************************************
*** 2011-07-26 09:59:41.111
ksedmp: internal OR fatal error
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
CURRENT SQL statement FOR this SESSION:
SELECT DISTINCT KC01.AAC001 AAC001 FROM AC04, AC02, KC01 WHERE KC01.AAB001 = :B2 AND KC01.AKC021 IN ('11', '12') AND AC02.AAE140 = '
3' AND AC02.AAC031 = '1' AND AC02.AAC001 = KC01.AAC001 AND AC04.AAC001 = KC01.AAC001 AND AC04.AAE030 <= TO_DATE(:B1 , 'yyyymm') AND 
(AC04.AAE031 >= TO_DATE(:B1 , 'yyyymm') OR AC04.AAE031 IS NULL) AND NVL(AC04.AKC010, 0) > 0 AND NOT EXISTS (SELECT 1 FROM AC01 WHERE
 AAB001 = :B2 AND NVL(AAC033, '0') = '3' AND AAC001 = KC01.AAC001) AND NOT EXISTS (SELECT 1 FROM DUAL WHERE :B3 <= (SELECT NVL(SUM(C
KC010), 0) FROM KC33 WHERE AAC001 = KC01.AAC001 AND AAE001 = TO_NUMBER(SUBSTR(:B1 , 1, 4)))) UNION SELECT DISTINCT KC01.AAC001 AAC00
1 FROM AC02, KC01 WHERE KC01.AAB001 = :B2 AND KC01.AKC021 IN ('21', '22', '23') AND AC02.AAE140 = '3' AND AC02.AAC031 = '1' AND AC02
.AAC001 = KC01.AAC001 AND NOT EXISTS (SELECT 1 FROM AC01 WHERE AAB001 = :B2 AND NVL(AAC033, '0') > '0' AND AAC001 = KC01.AAC001) AND
 NOT EXISTS (SELECT 1 FROM KC33 WHERE AAC001 = KC01.AAC001 AND AAE002 = :B1 ) AND (NOT EXISTS (SELECT 1 FROM DUAL WHERE :B4 <= (SELE
CT NVL(SUM(CKC010), 0) FROM KC33 WHERE AAC001 = KC01.AAC001 AND AAE001 = TO_NUMBER(SUBSTR(:B1 , 1, 4))) AND NVL(TO_CHAR(KC01.AIC162,
 'yyyymm'),'199001') < SUBSTR(:B1 , 1, 4) || '01')) AND (NOT EXISTS (SELECT 1 FROM DUAL WHERE :B3 <= (SELECT NVL(SUM(CKC010), 0) FRO
M KC33 WHERE AAC001 = KC01.AAC001 AND AAE001 = TO_NUMBER(SUBSTR(:B1 , 1, 4))) AND NVL(TO_CHAR(KC01.AIC162, 'yyyymm'),'199001') >= SU
BSTR(:B1 , 1, 4) || '01'))
----- PL/SQL Call Stack -----
  object      line  object
  handle    NUMBER  name
70000078a8a6960      8456  package body TJSI.PKG_A_JJZJ
70000078bd54dc0         1  anonymous block
----- Call Stack Trace -----
calling              CALL     entry                argument VALUES IN hex      
location             TYPE     point                (? means dubious VALUE)     
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              000000000 ? 000000000 ?
ksedmp+0290          bl       ksedst               104522370 ?
ksfdmp+0018          bl       03F2C1FC             
kgerinv+00dc         bl       _ptrgl               
kgeasnmierr+004c     bl       kgerinv              000000000 ? 000000001 ?
                                                   10D00000000 ? 000000000 ?
                                                   110648E58 ?
kdsgrp+0460          bl       01F92FB8             
kdsfbr+0268          bl       kdsgrp               000000001 ? 1105E2988 ?
                                                   000000000 ?
qertbFetchByRowID+0  bl       kdsfbr               FFFFFFFFFFF1FA0 ? 000000000 ?
9d0                                                1040DC78C ? 000000000 ?
                                                   1105FC720 ? 1101FB888 ?
                                                   1105FCF40 ? 000000000 ?
rwsfcd+0054          bl       _ptrgl               
qerflFetchOutside+0  bl       03F2BB90             
09c                                                
rwsfcd+0054          bl       _ptrgl               
qeruaFetch+013c      bl       03F2BB90             
qersoFetch+0110      bl       01F92FBC             
opifch2+141c         bl       01F92FBC             
opifch+003c          bl       opifch2              000000001 ? 11062BDC0 ?
                                                   FFFFFFFFFFF36D0 ?
opipls+21a4          bl       opifch               FFFFFFFFFFF3858 ? 110173648 ?
                                                   FFFFFFFFFFF3870 ?
opiodr+0ae0          bl       _ptrgl               
rpidrus+01bc         bl       opiodr               6686B0E7F0 ? 684B93430 ?
                                                   FFFFFFFFFFF67E0 ? D86B0E750 ?
skgmstack+00c8       bl       _ptrgl               
rpidru+0088          bl       skgmstack            110173648 ? FFFFFFFFFFF5830 ?
                                                   000000002 ? 000000000 ?
                                                   FFFFFFFFFFF5FA8 ?
rpiswu2+034c         bl       _ptrgl               
rpidrv+095c          bl       rpiswu2              7000007837E2BD8 ?
                                                   3A00000000 ? 10451BB80 ?
                                                   000000000 ? 110196C78 ?
                                                   3A10196C60 ?
                                                   FFFFFFFFFFF6008 ? 000000000 ?
psddr0+02bc          bl       01F96D24             
psdnal+01d0          bl       psddr0               D013BC57C ? 6610573D78 ?
                                                   FFFFFFFFFFF67E0 ?
                                                   3000000000 ?
pevm_BFTCHC+03b4     bl       _ptrgl               
pfrinstr_FTCHC+0138  bl       pevm_BFTCHC          101479974 ? 000000000 ?
                                                   000000010 ?
pfrrun_no_tool+005c  bl       _ptrgl               
pfrrun+1014          bl       pfrrun_no_tool       FFFFFFFFFFF6B20 ?
                                                   70000078BD54DC0 ? 3000009BB ?
plsql_run+06b4       bl       pfrrun               110469588 ?
peicnt+0224          bl       plsql_run            110469588 ? 10001102381F8 ?
                                                   000000000 ?
kkxexe+0250          bl       peicnt               FFFFFFFFFFF7E38 ? 110469588 ?
opiexe+2ef8          bl       kkxexe               11064CEA0 ?
kpoal8+0edc          bl       opiexe               FFFFFFFFFFFB454 ?
                                                   FFFFFFFFFFFB180 ?
                                                   FFFFFFFFFFF9628 ?
opiodr+0ae0          bl       _ptrgl               
ttcpip+1020          bl       _ptrgl               
opitsk+1124          bl       01F95258             
opiino+0990          bl       opitsk               0FFFFD490 ? 000000000 ?
opiodr+0ae0          bl       _ptrgl               
opidrv+0484          bl       01F947C0             
sou2o+0090           bl       opidrv               3C02D9855C ? 44065F000 ?
                                                   FFFFFFFFFFFF390 ?
opimai_real+01bc     bl       01F92214             
main+0098            bl       opimai_real          000000000 ? 000000000 ?
__start+0098         bl       main                 000000000 ? 000000000 ?

查询了一下metalink,发现错误号为kdsgrp1的已知600错误有20多个之多。当前的版本是10204,而这些已知bug又有80%都可能在这个版本上发生。不过仔细看一下错误的说明,这些错误有几分相似之处,多半都和表或索引的逻辑损坏有关。
在这个trace文件中,有明确的标识损坏之处:

            ROW 0766ce34.a continuation at
            file# 29 block# 2543156 slot 11 NOT found

显然这里出现了一个逻辑坏块,那么其实不用再去关心当前的问题属于哪个已知的bug,只要解决了这个坏块,就解决了这个ORA-600的错误。
有了FILE和BLOCK的序号,可以通过DUMP的手段,也可以查询DBA_EXTENTS视图,来确认坏块的内容。
如果坏块属于索引,那么可以通过删除索引然后重建的方式,很容易的解决问题。
如果坏块属于表,那么坏块中的部分行信息可能丢失。如果找不到出现错误之前的备份,可以通过DBMS_REPAIR包进行修复,或者设置alter session set events ‘10231 trace name context forever, level 10’,然后导出或CREATE TABLE AS SELECT将坏块意外的数据进行重建。
可以参考metalink文档ID 468883.1,获取更详细的信息。

Posted in BUG | Tagged , , , | Leave a comment

ORA-7445(qeshIHInsertPRowInBkt)错误

客户数据库告警日志出现了这个错误信息。
错误信息如下:

Fri Mar 18 09:52:26 2011
 Errors IN file /oracle/db/admin/B1EDB/udump/b1edb_ora_20076.trc:
 ORA-07445: exception encountered: core dump [qeshIHInsertPRowInBkt()+48] [SIGSEGV] [Address NOT mapped TO object] [0x000B2B7B0] [] []

对应的TRACE文件信息为:

/oracle/db/admin/B1EDB/udump/b1edb_ora_20076.trc
 Oracle DATABASE 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
 WITH the Partitioning, OLAP, DATA Mining AND REAL Application Testing options
 ORACLE_HOME = /oracle/db
 System name: HP-UX
 Node name: EDB
 Release: B.11.23
 Version: U
 Machine: ia64
 Instance name: B1EDB
 Redo thread mounted BY this instance: 1
 Oracle process NUMBER: 36
 Unix process pid: 20076, image: oracle@EDB
 
*** 2011-03-18 09:52:26.338
 *** ACTION NAME:() 2011-03-18 09:52:26.261
 *** MODULE NAME:(busobj.exe) 2011-03-18 09:52:26.261
 *** SERVICE NAME:(B1EDB) 2011-03-18 09:52:26.261
 *** SESSION ID:(501.23160) 2011-03-18 09:52:26.261
 Exception signal: 11 (SIGSEGV), code: 1 (Address NOT mapped TO object), addr: 0xb2b7b0, PC: [0x4000000002d51390, qeshIHInsertPRowInBkt()+48]
   r1: 60000000000b5e18       r20:                1       br5:                0
   r2: c000000040e04000       r21: 9fffffffab91bd50       br6: c000000000369bd0
   r3:             f04b       r22: 9fffffffbbc4a01f       br7: c000000000218e60
   r4:                0       r23:             1fd6        ip: 4000000002d51390
   r5: c000000000000408       r24: 9fffffffbf251110      iipa:                0
   r6: c000000000043810       r25:                8       cfm:             4b1e
   r7: 9fffffffbf7f8de8       r26:                7        um:               1a
   r8: 9fffffffab91bd69       r27: 9fffffffab91bd50       rsc:               1f
   r9:           1fffff       r28:                4       bsp: 9fffffffbf800600
  r10:             1fd2       r29: 9fffffffbf250100  bspstore: 9fffffffbf800600
  r11: 9fffffffab91bd88       r30:             1fdc      rnat: c0000002d4fd7b88
  r12: 9fffffffffff5e20       r31: 9fffffffab91bd00       ccv:                0
  r13: 9fffffffbf455430      NaTs:                0      unat:                0
  r14: 9fffffffab91bd88       PRs:             c84f      fpsr:    9804c8e76433f
  r15:             1f9f       br0: 4000000002cb5b70       pfs: c00000000000038e
  r16: 9fffffffbbc4a023       br1: c000000000218e60        lc:                0
  r17: 9fffffffab91bd50       br2:                0        ec:                0
  r18: 9fffffffbbc4a056       br3:                0       isr: 9fffffffbf800600
  r19:             1fd2       br4:                0       ifa:                0
 Reason code: 0008
 *** 2011-03-18 09:52:26.393
 ksedmp: internal OR fatal error
 ORA-07445: exception encountered: core dump [qeshIHInsertPRowInBkt()+48] [SIGSEGV] [Address NOT mapped TO object] [0x000B2B7B0] [] []
 CURRENT SQL statement FOR this SESSION:
 SELECT    
  EDS_LOT_RUN.EQPID, 
  SUM((to_date(EDS_LOT_RUN.TRACK_OUT_TIME, 'YYYYMMDD HH24MISS')  - to_date(EDS_LOT_RUN.TRACK_IN_TIME ,'YYYYMMDD HH24MISS')) * 24*60), 
  substr(EDS_LOT_EVENT_HIST.EVENT_TIME,1,8) ||' '|| substr(EDS_LOT_EVENT_HIST.EVENT_TIME,9,6)
 FROM 
  EDS_LOT_RUN, 
  EDS_LOT_EVENT_HIST, 
  EDS_LOT 
WHERE 
  ( EDS_LOT.LINE=EDS_LOT_EVENT_HIST.LINE AND 
EDS_LOT.FACTORY = EDS_LOT_EVENT_HIST.FACTORY AND 
EDS_LOT.LOTID= EDS_LOT_EVENT_HIST.LOTID  ) 
  AND  ( EDS_LOT.LINE=EDS_LOT_RUN.LINE AND  EDS_LOT.LOTID=EDS_LOT_RUN.LOTID  ) 
  AND  ( EDS_LOT_RUN.EVENT = 'TRACKOUT'  ) 
GROUP BY 
  EDS_LOT_RUN.EQPID,  
  substr(EDS_LOT_EVENT_HIST.EVENT_TIME,1,8) ||' '|| substr(EDS_LOT_EVENT_HIST.EVENT_TIME,9,6) 
----- Call Stack Trace -----
 calling              CALL     entry                argument VALUES IN hex      
location             TYPE     point                (? means dubious VALUE)     
-------------------- -------- -------------------- ----------------------------
 ksedst()+64          CALL     ksedst1()            000000001 ? 000000001 ?
 ksedmp()+2176        CALL     ksedst()             000000001 ?
                                                    C000000000000C9F ?
                                                    400000000404CC40 ?
                                                    000000000 ? 000000000 ?
                                                    000000000 ?
 ssexhd()+1264        CALL     ksedmp()             000000003 ?
                                                    9FFFFFFFFFFECBA0 ?
                                                    60000000000B5E18 ?
                                                    9FFFFFFFFFFED170 ?
                                                    C000000000000B9F ?
                                                    400000000696A380 ?
                                                    00000B6CD ?
                                                    60000000000C2ED0 ?
 <kernel>             CALL     ssexhd()             C00000033F4BF750 ?
                                                    60000000000C4DC0 ?
                                                    C000000040E0B7C0 ?
                                                    60000000000B5E18 ?
 qeshIHInsertPRowInB  CALL     <kernel>             9FFFFFFFFFFF1000 ?
 kt()+48                                            000000000 ? 000000000 ?
                                                    000000000 ? 000000000 ?
                                                    000000000 ? 000000000 ?
                                                    000000000 ?
 qeshIHInsertPRow()+  CALL     qeshIHInsertPRowInB  9FFFFFFFBF331430 ?
 80                            kt()                 9FFFFFFFBF256438 ?
                                                    05C1656F6 ? 0001656F6 ?
                                                    000B2B7B0 ?
                                                    9FFFFFFFAB91BD68 ?
                                                    000000000 ?
 qeshHDGrow()+1200    CALL     qeshIHInsertPRow()   9FFFFFFFBF331430 ?
                                                    9FFFFFFFBF256438 ?
                                                    05C1656F6 ?
                                                    9FFFFFFFAB91BD68 ?
                                                    000000000 ?
                                                    C00000000000091A ?
                                                    4000000005DEF540 ?
                                                    000000000 ?
 $cold_qeshPTInsertU  CALL     qeshHDGrow()         9FFFFFFFBF3307B8 ?
 RowForGBY()+256                                    9FFFFFFFBF251518 ?
                                                    9FFFFFFFFFFF5E48 ?
                                                    9FFFFFFFFFFF5E50 ?
                                                    60000000000B5E18 ?
                                                    000000000 ?
                                                    C000000000000B1D ?
 qeshLoadRowForGBY()  CALL     $cold_qeshPTInsertU  9FFFFFFFFFFF5FE0 ?
 +544                          RowForGBY()          9FFFFFFFFFFF6040 ?
                                                    9FFFFFFFBF251518 ?
                                                    9FFFFFFFFFFF5FE0 ?
                                                    9FFFFFFFBF3307B8 ?
                                                    9FFFFFFFBF251548 ?
                                                    000030E3C ?
                                                    9FFFFFFF881971E0 ?
 qerghRowP()+992      CALL     qeshLoadRowForGBY()  9FFFFFFFBF3307B8 ?
                                                    9FFFFFFFBF3CB860 ?
                                                    000000005 ?
                                                    60000000000B5E18 ?
                                                    C000000000000EA5 ?
                                                    4000000002EEBE20 ?
                                                    9FFFFFFFFFFF6040 ?
                                                    9FFFFFFFBF330850 ?
 qerhjWalkHashBucket  CALL     qerghRowP()          C0000002D4071220 ?
 2()+1632                                           000007FFF ?
                                                    60000000000B5E18 ?
                                                    C000000000001026 ?
                                                    4000000007E45220 ?
                                                    00000814B ?
 qerhjFetchPhase2()+  CALL     qerhjWalkHashBucket  C0000002D40714F8 ?
 1536                          2()                  4000000001C184B0 ?
                                                    C0000002D4071220 ?
                                                    000007FFF ?
                                                    9FFFFFFFBF3CB490 ?
                                                    9FFFFFFFFFFF6180 ?
                                                    000000000 ?
 $cold_qerhjFetch()+  CALL     qerhjFetchPhase2()   C0000002D40714F8 ?
 4080                                               4000000001C184B0 ?
                                                    C0000002D4071220 ?
                                                    000007FFF ?
                                                    60000000000A5B28 ?
                                                    C000000000000F24 ?
 qerghFetch()+528     CALL     $cold_qerhjFetch()   C0000002D40714F8 ?
                                                    4000000001C184B0 ?
                                                    C0000002D4071220 ?
                                                    000007FFF ?
                                                    60000000000B5E18 ?
 opifch2()+9632       CALL     qerghFetch()         C0000002D4071220 ?
                                                    4000000001C47AD0 ?
                                                    9FFFFFFFFFFF6550 ?
                                                    0000000C9 ?
                                                    60000000000B5E18 ?
                                                    C000000000001F46 ?
                                                    4000000002EB3640 ?
                                                    000018375 ?
 kpoal8()+6240        CALL     opifch2()            9FFFFFFFFFFF7510 ?
                                                    4000000002AE5700 ?
                                                    00001E80B ?
                                                    9FFFFFFFFFFF63C0 ?
                                                    60000000000B5E18 ?
                                                    C0000000000015B3 ?
                                                    60000000000314F0 ?
                                                    040002D91 ?
 opiodr()+2128        CALL     kpoal8()             9FFFFFFFFFFF7C40 ?
                                                    C000000000001530 ?
                                                    000000000 ?
                                                    9FFFFFFFFFFF7560 ?
                                                    60000000000B5E18 ?
                                                    9FFFFFFFBF3CCE40 ?
 ttcpip()+3088        CALL     opiodr()             00000005E ? 000000017 ?
                                                    4000000001B5F640 ?
                                                    0000046B0 ?
                                                    9FFFFFFFFFFF7C50 ?
                                                    60000000000B5E18 ?
                                                    9FFFFFFFFFFF9DC0 ?
                                                    C000000000001430 ?
 opitsk()+2336        CALL     ttcpip()             600000000003D0F0 ?
                                                    9FFFFFFFFFFFA278 ?
                                                    9FFFFFFFFFFFA320 ?
                                                    0000001AF ?
                                                    9FFFFFFFFFFFA490 ?
                                                    9FFFFFFFFFFFA284 ?
                                                    4000000001C45750 ?
                                                    000000000 ?
 opiino()+1840        CALL     opitsk()             000000000 ? 000000000 ?
                                                    60000000000B5E18 ?
                                                    4000000002851570 ?
                                                    0000180CD ?
                                                    4000000001B5F658 ?
 opiodr()+2128        CALL     opiino()             00000003C ?
                                                    9FFFFFFFFFFFCCE0 ?
                                                    9FFFFFFFFFFFF480 ?
                                                    9FFFFFFFFFFFC1A0 ?
                                                    60000000000B5E18 ?
                                                    C000000000001530 ?
 opidrv()+1088        CALL     opiodr()             00000003C ? 000000004 ?
                                                    4000000001B5F0F0 ?
                                                    0000046B0 ?
                                                    9FFFFFFFFFFFCCF0 ?
                                                    60000000000B5E18 ?
 sou2o()+336          CALL     opidrv()             00000003C ?
                                                    9FFFFFFFFFFFF480 ?
                                                    60000000000C2790 ?
 opimai_real()+224    CALL     sou2o()              9FFFFFFFFFFFF4A0 ?
                                                    00000003C ? 000000004 ?
                                                    9FFFFFFFFFFFF480 ?
 main()+368           CALL     opimai_real()        000000000 ?
                                                    9FFFFFFFFFFFF4D0 ?
 main_opd_entry()+80  CALL     main()               000000002 ?
                                                    9FFFFFFFFFFFF978 ?
                                                    60000000000B5E18 ?
                                                    C000000000000004 ?
 .
 .
 .
 ============
 Plan TABLE
 ============
 ------------------------------------------+--------------------------------+---------------+
 | Id| Operation               | Name              |ROWS |Bytes|Cost |TIME    |Pstart|Pstop |
 ------------------------------------------+--------------------------------+---------------+
 | 0 | SELECT STATEMENT        |                   |     |     | 922K|        |      |      |
 | 1 |  HASH GROUP BY          |                   |  25M|2876M| 922K|03:09:50|      |      |
 | 2 |   HASH JOIN             |                   |  25M|2876M| 259K|00:53:59|      |      |
 | 3 |    HASH JOIN            |                   |2734K| 216M|  61K|00:13:32|      |      |
 | 4 |     PARTITION RANGE ALL |                   | 940K|  19M| 6167|00:01:15 | 1    | 16   |
 | 5 |      TABLE ACCESS FULL  | EDS_LOT           | 940K|  19M| 6167|00:01:15 | 1    | 16   |
 | 6 |     PARTITION RANGE ALL |                   |2734K| 160M|  44K|00:09:05 | 1    | 16   |
 | 7 |      TABLE ACCESS FULL  | EDS_LOT_RUN       |2734K| 160M|  44K|00:09:05 | 1    | 16   |
 | 8 |    PARTITION RANGE ALL  |                   |  17M| 624M| 145K|00:30:41 | 1    | 16   |
 | 9 |     TABLE ACCESS FULL   | EDS_LOT_EVENT_HIST|  17M| 624M| 145K|00:30:41 | 1    | 16   |
 -------------------------------------------+-------------------------------+---------------+

根据TRACE信息和执行计划分析,这个问题是MOS文档ID 6818207.8描述的Bug 6818207 – Intermittent Wrong Results/Dump (qeshIHInsertPRowInBkt) using hash GROUP BY of large dataset。

从执行计划不难看出,HASH GROUP BY要处理的数据量很大,预计运行时间也很长,所有因素都有bug相关描述相符,这个bug在11.2.0.2中被FIXED,在当前版本中,可以通过设置”_gby_hash_aggregation_enabled”隐患函数为FALSE,来禁止HASH GROUP BY执行计划,从而避免错误的产生。

如果错误发生频率很低,也可以通过优化SQL的方式解决或者简单忽略之。

Posted in BUG | Tagged , , , | Leave a comment