ORA-7445(dbgrlWriteAlertDetail_int)和ORA-4030导致实例崩溃

客户11.2.0.2 RAC for Solaris 10 sparc单实例出现ORA-7445和ORA-4030操作,导致实例崩溃。
这个错误比较严重:

2012-05-04 02:02:26.403000 +08:00
Archived Log entry 949 added FOR thread 1 SEQUENCE 518 ID 0x70a64e83 dest 1:
Archived Log entry 950 added FOR thread 1 SEQUENCE 519 ID 0x70a64e83 dest 1:
Thread 1 advanced TO log SEQUENCE 521 (after internal thread enable)
Thread 2 opened at log SEQUENCE 441
CURRENT log# 4 seq# 441 mem# 0: /orcldata1/orcl/redo04.log
CURRENT log# 4 seq# 441 mem# 1: /orcldata2/orcl/redo04.log
Successful OPEN OF redo thread 2
MTTR advisory IS disabled because FAST_START_MTTR_TARGET IS NOT SET
SMON: enabling cache recovery
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
2012-05-04 02:02:27.570000 +08:00
[374] Successfully onlined Undo Tablespace 5.
Undo initialization finished serial:0 START:903149023 END:903149480 diff:457 (4 seconds)
Verifying file header compatibility FOR 11g tablespace encryption..
Verifying 11g file header compatibility FOR tablespace encryption completed
Redo thread 1 internally disabled at seq 521 (CKPT)
SMON: enabling tx recovery
DATABASE Characterset IS ZHS16GBK
Archived Log entry 951 added FOR thread 1 SEQUENCE 520 ID 0x70a64e83 dest 1:
ARC3: Archiving disabled thread 1 SEQUENCE 521
Archived Log entry 952 added FOR thread 1 SEQUENCE 521 ID 0x70a64e83 dest 1:
No Resource Manager plan active
minact-scn: Inst 2 IS now the master inc#:2 mmon proc-id:326 STATUS:0x7
minact-scn STATUS: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000
2012-05-04 02:02:29.806000 +08:00
Starting background process GTX0
GTX0 started WITH pid=72, OS id=660
Starting background process RCBG
RCBG started WITH pid=73, OS id=662
replication_dependency_tracking turned off (no async multimaster replication found)
Thread 2 advanced TO log SEQUENCE 442 (LGWR switch)
CURRENT log# 3 seq# 442 mem# 0: /orcldata1/orcl/redo03.log
CURRENT log# 3 seq# 442 mem# 1: /orcldata2/orcl/redo03.log
Archived Log entry 953 added FOR thread 2 SEQUENCE 441 ID 0x70a64e83 dest 1:
2012-05-04 02:02:31.733000 +08:00
Starting background process QMNC
2012-05-04 02:02:37.268000 +08:00
Exception [TYPE: SIGSEGV, Address NOT mapped TO object] [ADDR:0xFFFFFFFF7FFF5EF8] [PC:0x108068724, dbgrlWriteAlertDetail_int()+132] [flags: 0x0, COUNT: 1]
Exception [TYPE: SIGSEGV, Address NOT mapped TO object] [ADDR:0xFFFFFFFF7FFF5FF4] [PC:0xFFFFFFFF7BD00F34, _memset()+52] [flags: 0x0, COUNT: 1]
2012-05-04 02:02:38.594000 +08:00
Errors IN file /opt/oracle/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_m000_666.trc (incident=156594):
ORA-07445: exception encountered: core dump [dbgrlWriteAlertDetail_int()+132] [SIGSEGV] [ADDR:0xFFFFFFFF7FFF5EF8] [PC:0x108068724] [Address NOT mapped TO obj
ect] []
ORA-04030: OUT OF process memory WHEN trying TO allocate 67108896 bytes (qesmmCheckPgaL,qesmmCheckPgaLimit:mem)
Incident details IN: /opt/oracle/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_156594/orcl2_m000_666_i156594.trc
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
Errors IN file /opt/oracle/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_psp0_266.trc (incident=156026):
ORA-07445: exception encountered: core dump [_memset()+52] [SIGSEGV] [ADDR:0xFFFFFFFF7FFF5FF4] [PC:0xFFFFFFFF7BD00F34] [Address NOT mapped TO object] []
Incident details IN: /opt/oracle/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_156026/orcl2_psp0_266_i156026.trc
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
2012-05-04 02:02:49.732000 +08:00
Exception [TYPE: SIGSEGV, Address NOT mapped TO object] [ADDR:0xFFFFFFFF7FFF2000] [PC:0x107D04724, dbgemdGetCallStackWFlag()+100] [flags: 0x0, COUNT: 1]
Errors IN file /opt/oracle/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_dia0_282.trc:
ORA-07445: exception encountered: core dump [dbgemdGetCallStackWFlag()+100] [SIGSEGV] [ADDR:0xFFFFFFFF7FFF2000] [PC:0x107D04724] [Address NOT mapped TO objec
t] []
ORA-04030: OUT OF process memory WHEN trying TO allocate 816 bytes (ksdhngmemctx_h,ksdhng:enod)
2012-05-04 02:02:55.200000 +08:00
Exception [TYPE: SIGSEGV, Address NOT mapped TO object] [ADDR:0xFFFFFFFF7FFF2000] [PC:0x107D04724, dbgemdGetCallStackWFlag()+100] [flags: 0x0, COUNT: 1]
Errors IN file /opt/oracle/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_diag_274.trc:
ORA-07445: exception encountered: core dump [dbgemdGetCallStackWFlag()+100] [SIGSEGV] [ADDR:0xFFFFFFFF7FFF2000] [PC:0x107D04724] [Address NOT mapped TO objec
t] []
ORA-04030: OUT OF process memory WHEN trying TO allocate 32128 bytes (pga heap,grpsvc msg)
2012-05-04 02:02:58.724000 +08:00
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
2012-05-04 02:04:01.919000 +08:00
PMON (ospid: 264): terminating the instance due TO error 490
2012-05-04 02:04:05.129000 +08:00
Instance TERMINATED BY PMON, pid = 264

数据库的启动操作还没完成,就碰到ORA-7445 [dbgrlWriteAlertDetail_int]错误,随后是ORA-04030错误,接着是ORA-07445[_memset]错误,最后是ORA-7445[dbgemdGetCallStackWFlag]错误。而这一系列的错误出现,最终导致了PMON结束了数据库实例。
从错误信息上看,和内存分配有关,但是数据库刚启动,怎么会连67M的内存都无法分配呢,查询MOS发现,原来是SWAP空间耗尽导致的,详细描述可以参考Instance crash ORA-7445 [_memset()+120] and ORA-4030 (QERHJ hash-joi,kllcqas:kllsltba) [ID 1071033.1]
检查系统的日志信息message:

May 4 02:02:14 orcl2 Had[5187]: [ID 702911 daemon.notice] VCS CRITICAL V-16-1-50086 Swap usage ON orcl2 IS 97%
May 4 02:02:33 orcl2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap SPACE TO grow stack FOR pid 666 (oracle)
May 4 02:02:33 orcl2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap SPACE TO grow stack FOR pid 674 (oracle)
May 4 02:02:33 orcl2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap SPACE TO grow stack FOR pid 668 (oracle)
May 4 02:02:34 orcl2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap SPACE TO grow stack FOR pid 266 (oracle)
May 4 02:02:35 orcl2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap SPACE TO grow stack FOR pid 666 (oracle)
May 4 02:02:35 orcl2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap SPACE TO grow stack FOR pid 668 (oracle)
May 4 02:02:35 orcl2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap SPACE TO grow stack FOR pid 674 (oracle)
May 4 02:02:35 orcl2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap SPACE TO grow stack FOR pid 266 (oracle)
May 4 02:02:36 orcl2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap SPACE TO grow stack FOR pid 666 (oracle)
May 4 02:02:36 orcl2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap SPACE TO grow stack FOR pid 674 (oracle)
May 4 02:02:36 orcl2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap SPACE TO grow stack FOR pid 668 (oracle)
May 4 02:02:36 orcl2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap SPACE TO grow stack FOR pid 266 (oracle)
May 4 02:02:37 orcl2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap SPACE TO grow stack FOR pid 666 (oracle)
May 4 02:02:37 orcl2 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap SPACE TO grow stack FOR pid 668 (oracle)

果然找到大量的SWAP空间不足的告警,对于Solaris系统而言,清理/tmp空间,然后重启数据库,问题不再出现。

Posted in BUG | Tagged , , , , , , , , | Leave a comment

ORA-600(ksnpost:ksnigb)错误

客户10.2.0.5 RAC for HP-UX ia64出现ORA-600(ksnpost:ksnigb)的错误。
错误信息如下:

Wed Apr 25 16:32:21 UTC 2012
Errors IN file /oracle/app/admin/orcl/bdump/orcl2_ora_17014.trc:
ORA-00600: Message 600 NOT found; No message file FOR product=RDBMS, facility=ORA; arguments: [ksnpost:ksnigb]
Wed Apr 25 16:32:55 UTC 2012
Trace dumping IS performing id=[cdmp_20120425163255]
Wed Apr 25 16:32:56 UTC 2012
Errors IN file /oracle/app/admin/orcl/bdump/orcl2_ora_5511.trc:
ORA-12012: 自动执行作业 8938 出错
ORA-12801: 并行查询服务器 ORA-12801: 并行查询服务器 P122, instance orcldb2:orcl2 (2) 中发出错误信号
ORA-00600: Message 600 NOT found; No message file FOR product=RDBMS, facility=ORA; arguments: [ksnpost:ksnigb]
ORA-06512: 在 "SYS.DBMS_STATS", line 21275
中发出错误信号
Wed Apr 25 16:35:41 UTC 2012
Errors IN file /oracle/app/admin/orcl/bdump/orcl2_ora_19966.trc:
ORA-27041: unable TO OPEN file
HPUX-ia64 Error: 23: File TABLE overflow
Additional information: 44
Additional information: 65598
Additional information: 2734688
Wed Apr 25 16:35:41 UTC 2012
Errors IN file /oracle/app/admin/orcl/bdump/orcl2_ora_19966.trc:
ORA-27041: unable TO OPEN file
HPUX-ia64 Error: 23: File TABLE overflow
Additional information: 44
Additional information: 65598
Additional information: 2734688
Wed Apr 25 16:35:41 UTC 2012
Errors IN file /oracle/app/admin/orcl/bdump/orcl2_ora_19966.trc:
ORA-27041: unable TO OPEN file
HPUX-ia64 Error: 23: File TABLE overflow
Additional information: 44
Additional information: 65598
Additional information: 2734688

详细TRACE信息为:

orcldb2>more /oracle/app/admin/orcl/bdump/orcl2_ora_17014.trc
Dump file /oracle/app/admin/orcl/bdump/orcl2_ora_17014.trc
Oracle DATABASE 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
WITH the Partitioning, REAL Application Clusters, OLAP, DATA Mining
AND REAL Application Testing options
ORACLE_HOME = /oracle/app/product/10.2.0/db_1
System name:    HP-UX
Node name:      orcldb2
Release:        B.11.31
Version:        U
Machine:        ia64
Instance name: orcl2
Redo thread mounted BY this instance: 2
Oracle process NUMBER: 0
Unix process pid: 17014, image: oracle@orcldb2
Ioctl ASYNC_CONFIG error, errno = 1
*** SERVICE NAME:(SYS$USERS) 2012-04-25 16:32:21.472
*** SESSION ID:(377.37550) 2012-04-25 16:32:21.472
*** 2012-04-25 16:32:21.472
ksedmp: internal OR fatal error
ORA-00600: Message 600 NOT found; No message file FOR product=RDBMS, facility=ORA; arguments: [ksnpost:ksnigb]
CURRENT SQL statement FOR this SESSION:
SELECT /*+ parallel_index(t,"IDX_PRODUCT_ID",96) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring no_expand index_ffs(t,"
IDX_PRODUCT_ID") */ COUNT(*) AS nrw,COUNT(DISTINCT sys_op_lbid(82769,'L',t.rowid)) AS nlb,COUNT(DISTINCT "PRODUCT_ID") AS ndk,sys_op_countchg(substrb(t.rowid
,1,15),1) AS clf FROM "ORCL_UNICOM"."MESSAGE_LOG" sample block (   .4377057550,1)  t WHERE ("PRODUCT_ID" IS NOT NULL) AND (TBL$OR$IDX$PART$NUM("ORCL_UNICOM"."M
ESSAGE_LOG",0,4,0,"ROWID") = :objn)
----- Call Stack Trace -----
calling              CALL     entry                argument VALUES IN hex      
location             TYPE     point                (? means dubious VALUE)     
-------------------- -------- -------------------- ----------------------------
40000000040333a0     CALL     4000000004035b60     000000000 ? 000000001 ?
4000000004032b40     CALL     4000000004033360     000000000 ?
                                                   C000000000000D20 ?
                                                   4000000004032B40 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ?
400000000407abb0     CALL     40000000040322c0     000000003 ?
                                                   9FFFFFFFFFFEFC10 ?
                                                   60000000000BA268 ?
                                                   9FFFFFFFFFFF01E0 ?
                                                   C000000000000999 ?
                                                   400000000407ABB0 ?
40000000098c33d0     CALL     400000000407ab40     9FFFFFFFFFFF0770 ?
                                                   000000003 ?
                                                   9FFFFFFFFFFF01F0 ?
                                                   60000000000BA268 ?
                                                   C000000000000612 ?
                                                   40000000098C33D0 ?
40000000098c41c0     CALL     40000000098c32a0     60000000000318D0 ?
                                                   4000000001ADC4E0 ?
                                                   6000000000032988 ?
                                                   4000000001ADC4E0 ?
                                                   9FFFFFFFFFFF07A0 ?
4000000004030400     CALL     40000000098c4180     60000000000318D0 ?
                                                   9FFFFFFFFD364638 ?
                                                   9FFFFFFFFD364648 ?
                                                   6000000000032D00 ?
                                                   9FFFFFFFFFFF07A0 ?
4000000002071430     CALL     4000000004030350     4000000000F67410 ?
                                                   000000000 ?
                                                   60000000000C6C98 ?
40000000030c7ce0     CALL     40000000020713d0     C000001FF2838160 ?
                                                   60000000000BA268 ?
                                                   C000000000000797 ?
                                                   40000000030C7CE0 ?
                                                   9FFFFFFFFFFF07DC ?
                                                   9FFFFFFFFFFF07E0 ?
                                                   9FFFFFFFFFFF07D0 ?
                                                   000000001 ?
40000000030c8940     CALL     40000000030c7a50     C0000020351ED5A0 ?

查询MOS,发现和问题文档Get Various ORA-600 Error Followed By “HPUX-ia64 Error: 24: Too many open files” [ID 1061351.1]描述得比较接近。虽然文档描述的问题是HPUX-ia64 Error: 24,而当前的错误是HPUX-ia64 Error: 23: File table overflow,但是二者的本质是一样的。
从错误信息上看65598已经超过了设置值65536,针对这个问题,可以通过调整内核参数的方式进行解决,可以参考文档How to Increase the Number of Open Files on HP-UX [ID 1264834.1]。
而对于当前的情况,问题发生在并行执行上,并行度开到96显然过大,将其降至并行度16应该就可以避免该错误。

Posted in BUG | Tagged , , , , , | Leave a comment

在线索引重建被取消导致大量ORA-600(kghstack_underflow_internal_3)错误

客户在比较繁忙的时刻执行了索引的REBUILD操作,导致大量会话被锁,最终对ONLINE REBUILD操作执行了取消操作,引发了一系列更严重的ORA-600错误。
登录数据库后,发现两个节点上ORACLE_BASE所在目录都已经100%占用。数据库无法正常通过/ AS SYSDBA方式登录。
查询告警日志发现大量的ORA-600和ORA-7445错误:

Tue May 08 21:20:26 EAT 2012
Errors IN file /oracle/app/admin/orcl/udump/orcl2_ora_1555.trc:
ORA-07445: exception encountered: core dump [0000000000000000] [SIGSEGV] [Invalid permissions FOR mapped object] [0x000000000] [] []
Tue May 08 21:20:27 EAT 2012
Errors IN file /oracle/app/admin/orcl/udump/orcl2_ora_3891.trc:
ORA-00600: internal error code, arguments: [kghstack_underflow_internal_3], [0x60000000003002F0], [keybuf], [], [], [], [], []
Tue May 08 21:20:27 EAT 2012
Errors IN file /oracle/app/admin/orcl/udump/orcl2_ora_26190.trc:
ORA-07445: exception encountered: core dump [0000000000000000] [SIGSEGV] [Invalid permissions FOR mapped object] [0x000000000] [] []
Tue May 08 21:20:27 EAT 2012
Errors IN file /oracle/app/admin/orcl/udump/orcl2_ora_2873.trc:
ORA-07445: exception encountered: core dump [0000000000000000] [SIGSEGV] [Invalid permissions FOR mapped object] [0x000000000] [] []
Tue May 08 21:20:27 EAT 2012
Errors IN file /oracle/app/admin/orcl/udump/orcl2_ora_4518.trc:
ORA-00600: internal error code, arguments: [kghstack_underflow_internal_3], [0x60000000003002F0], [keybuf], [], [], [], [], []
Tue May 08 21:20:27 EAT 2012
Errors IN file /oracle/app/admin/orcl/udump/orcl2_ora_22469.trc:
ORA-00600: internal error code, arguments: [kghstack_underflow_internal_3], [0x60000000003002F0], [keybuf], [], [], [], [], []
Tue May 08 21:20:28 EAT 2012
Errors IN file /oracle/app/admin/orcl/udump/orcl2_ora_26440.trc:
ORA-07445: exception encountered: core dump [0000000000000000] [SIGSEGV] [Invalid permissions FOR mapped object] [0x000000000] [] []
Tue May 08 21:20:29 EAT 2012
Errors IN file /oracle/app/admin/orcl/udump/orcl2_ora_762.trc:
ORA-07445: exception encountered: core dump [0000000000000000] [SIGSEGV] [Invalid permissions FOR mapped object] [0x000000000] [] []
Tue May 08 21:20:29 EAT 2012
Errors IN file /oracle/app/admin/orcl/udump/orcl2_ora_26106.trc:
ORA-07445: exception encountered: core dump [0000000000000000] [SIGSEGV] [Invalid permissions FOR mapped object] [0x000000000] [] []
Tue May 08 21:20:30 EAT 2012
Errors IN file /oracle/app/admin/orcl/udump/orcl2_ora_1597.trc:
ORA-07445: 出现异常错误: 核心转储 [0000000000000000] [SIGSEGV] [Invalid permissions FOR mapped object] [0x000000000] [] []
Tue May 08 21:20:30 EAT 2012
Errors IN file /oracle/app/admin/orcl/udump/orcl2_ora_856.trc:
ORA-07445: exception encountered: core dump [0000000000000000] [SIGSEGV] [Invalid permissions FOR mapped object] [0x000000000] [] []
.
.
.

检查对应的TRACE文件,发现导致错误语句执行的是TABLE_A表的插入:

ksedmp: internal OR fatal error
ORA-00600: 内部错误代码, 参数: [kghstack_underflow_internal_3], [0x60000000003002F0], [keybuf], [], [], [], [], []
CURRENT SQL statement FOR this SESSION:
INSERT INTO TABLE_A (O_ID, P_ID, S_ID, F_ITEM, F_NAME, T_MON, D_MON, I_MON, P_STATE, P_TIME, R1, R2) VALUES (:B10 , SUBSTR(:B10 , LENGTH(:B10 ) - 1, 2), :B9 , :B8 , :B7 , :B6 , :B5 , :B4 , :B3 , NULL, :B2 , :B1 )
----- PL/SQL Call Stack -----
  object      line  object
  handle    NUMBER  name
c00000203c2cc550       119  package body U1.P_O_I
c00000203b788200       288  PROCEDURE U1.U_B_O_I
c00000203719b8d0         1  anonymous block
----- Call Stack Trace -----
calling              CALL     entry                argument VALUES IN hex      
location             TYPE     point                (? means dubious VALUE)     
-------------------- -------- -------------------- ----------------------------
ksedst()+64          CALL     ksedst1()            000000000 ? 000000001 ?
ksedmp()+2176        CALL     ksedst()             000000000 ?
                                                   C000000000000D20 ?
                                                   4000000004037940 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ?
ksfdmp()+112         CALL     ksedmp()             000000003 ?
                                                   9FFFFFFFFFFF1230 ?
                                                   60000000000BA290 ?
                                                   9FFFFFFFFFFF1800 ?
                                                   C000000000000999 ?
                                                   400000000407F9B0 ?
kgerinv()+304        CALL     ksfdmp()             9FFFFFFFFFFF1D90 ?
                                                   000000003 ?
                                                   9FFFFFFFFFFF1810 ?
                                                   60000000000BA290 ?
                                                   C000000000000612 ?
                                                   40000000098C38B0 ?
kgeasnmierr()+144    CALL     kgerinv()            60000000000318D0 ?
                                                   4000000001AD98A0 ?
                                                   6000000000032988 ?
                                                   4000000001AD98A0 ?
                                                   9FFFFFFFFFFF1DD0 ?
$cold_kghstack_unde  CALL     kgeasnmierr()        60000000000318D0 ?
rflow_internal()+36                                9FFFFFFFBF3B1168 ?
8                                                  9FFFFFFFBF3B1178 ?
                                                   6000000000032D00 ?
                                                   000000002 ?
                                                   60000000003002F0 ?
                                                   000000001 ? 000000006 ?
kghstack_free()+208  CALL     $cold_kghstack_unde  60000000000318D0 ?
                              rflow_internal()     60000000003002F0 ?
                                                   60000000000BA290 ?
                                                   C000000000000B1D ?
                                                   4000000002F7A510 ?
                                                   00002C87B ?
                                                   6000000000031A10 ?
ksmfrs()+48          CALL     kghstack_free()      60000000000318D0 ?
                                                   60000000003002F0 ?
rpiswu2()+1312       CALL     ksmfrs()             60000000003002F0 ?
                                                   C000000000001026 ?
                                                   4000000002F78960 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ?
rpidrv()+2352        CALL     rpiswu2()            9FFFFFFFFFFF2AF0 ?
                                                   4000000002F7AE60 ?
                                                   00002F833 ?
                                                   60000000000A7D20 ?
                                                   9FFFFFFFFFFF1E20 ?
                                                   C000000000001ABD ?
                                                   4000000000F14558 ?
                                                   60000000000C6CF0 ?
psddr0()+864         CALL     rpidrv()             000000018 ? 000000066 ?
                                                   9FFFFFFFFFFF3700 ?
                                                   000000038 ?
                                                   9FFFFFFFFFFF2B20 ?
                                                   60000000000BA290 ?
psdnal()+736         CALL     psddr0()             000000018 ? 000000066 ?
                                                   9FFFFFFFFFFF3700 ?
                                                   000000030 ?
pevm_EXECC()+832     CALL     psdnal()             9FFFFFFFFFFF54D0 ?
                                                   C00000203489A9F8 ?
                                                   C0000000000011AA ?
                                                   9FFFFFFFBE832840 ?
                                                   C00000203C2CC550 ?
                                                   400000000313C770 ?
                                                   000026035 ?
pfrinstr_EXECC()+16  CALL     pevm_EXECC()         9FFFFFFFBE83D1D0 ?
0                                                  9FFFFFFFBE8328B0 ?
                                                   000000020 ?
pfrrun_no_tool()+19  CALL     pfrinstr_EXECC()     9FFFFFFFBE83D1D0 ?
2                                                  C000001DA198D61C ?
                                                   9FFFFFFFBE83D238 ?
pfrrun()+1376        CALL     pfrrun_no_tool()     9FFFFFFFBE83D1D0 ?
                                                   000002001 ?
                                                   9FFFFFFFBE83D238 ?
                                                   60000000000BA290 ?
                                                   C000000000000A1C ?
                                                   4000000003163040 ?
                                                   9FFFFFFFBE83D620 ?
                                                   9FFFFFFFBE83D298 ?
plsql_run()+1328     CALL     pfrrun()             9FFFFFFFFFFF3788 ?
                                                   9FFFFFFFFFFF3770 ?
                                                   60000000000BA290 ?
                                                   9FFFFFFFFFFF4370 ?
                                                   9FFFFFFFFFFF4370 ?
                                                   C000000000000E23 ?
                                                   4000000002C34D70 ?
peicnt()+560         CALL     plsql_run()          9FFFFFFFFFFF4380 ?
                                                   000000001 ? 000000000 ?
                                                   9FFFFFFFFFFF4380 ?
                                                   60000000000BA290 ?
                                                   9FFFFFFFFFFF4E90 ?
kkxexe()+1008        CALL     peicnt()             9FFFFFFFFFFF54D0 ?
                                                   9FFFFFFFBE83D1D0 ?
                                                   9FFFFFFFFFFF4EA0 ?
                                                   60000000000BA290 ?
                                                   9FFFFFFFFFFF5420 ?
                                                   C000000000000A1C ?
                                                   000000020 ?
                                                   9FFFFFFFFFFF4F00 ?
opiexe()+8016        CALL     kkxexe()             0000051F0 ?
.
.
.

这个SQL语句是正常的业务操作,而这种简单的INSERT都会导致ORA-600和ORA-7445错误,说明插入的表或索引本身存在问题。
检查表和索引的状态,发现索引部分分区状态异常:

CREATE INDEX "U1"."IDX_F_2" 
ON "U1"."TABLE_A" ("S_ID")
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(
BUFFER_POOL DEFAULT) LOCAL
(PARTITION "P1"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS7" ,
PARTITION "P2"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS8" ,
PARTITION "P3"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS9" ,
PARTITION "P4"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS10" ,
PARTITION "P5"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS1" ,
PARTITION "P6"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS2" ,
PARTITION "P7"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS3" ,
PARTITION "P8"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS4" ,
PARTITION "P9"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS5" ,
PARTITION "P10"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS6" )
ALTER INDEX "U1"."IDX_F_2" MODIFY PARTITION "P6" UNUSABLE
ALTER INDEX "U1"."IDX_F_2" MODIFY PARTITION "P7" UNUSABLE

和客户沟通后,确认今天问题发生之前有人对索引进行维护操作,操作语句包括:

ALTER INDEX u1.IDX_F_2 rebuild partition p6 parallel
ALTER INDEX U1.IDX_T_2 rebuild partition P9 nologging parallel online
ALTER INDEX U1.IDX_T_1 rebuild partition P3 nologging parallel online
ALTER INDEX U1.IDX_F_2 rebuild partition P1 nologging parallel online
ALTER INDEX U1.IDX_T_4 rebuild partition P1 nologging parallel online
ALTER INDEX U1.IDX_F_4 rebuild partition P5 nologging parallel online

在对索引进行维护之后,数据库中出现了大量的enq: TM – contention等待,TM锁出现的原和是索引IDX_F_2的PARTITION P6重建语句没有添加ONLINE参数有直接关系。随后大量会话被阻塞,而整个索引重建操作被人为中止,其中包括正在运行的ONLINE REBUILD操作,而ONLINE REBUILD操作被中止是十分危险的,很容易导致bug的产生,而当前就导致了ORA-600和ORA-7445的产生。
由于客户着急解决问题,因此对数据库进行了重启。重启后,ORA-600错误信息不再出现。但是前台应用报错,索引失效。

SELECT 'ALTER INDEX U1.' || INDEX_NAME || ' REBUILD PARTITION ' || PARTITION_NAME || ';'
FROM DBA_IND_PARTITIONS
WHERE INDEX_OWNER = 'U1'
AND STATUS != 'USABLE';

获取所有失效的索引重建语句,将索引重新编译后,数据库恢复正常。
在Oracle的MOS上没有找到与之最相关的bug信息,反而是找到了一个相关性很高的补丁信息,而对应的bug信息并没有公开:补丁程序13737888: ONLINE DDL:ORA-600[KGHSTACK_UNDERFLOW_INTERNAL_3], [0X2B7F4E1E7678], [KEYBUF]。

Posted in ORACLE | Tagged , , , , , , , | Leave a comment

RAC环境启动单实例报错ORA-1105

客户的4节点RAC环境,其中一个节点实例出现故障,发现无法正常启动。
检查CLUSTER和告警日志信息,发现节点1心跳超时,被踢出集群。服务器重新启动后,数据库实例没有自动启动。
告警日志信息为:

Mon Apr 16 03:42:39 2012
Thread 1 advanced TO log SEQUENCE 22348 (LGWR switch)
CURRENT log# 16 seq# 22348 mem# 0: +DATA/orcl/onlinelog/group_16.291.766326571
CURRENT log# 16 seq# 22348 mem# 1: +DATA/orcl/onlinelog/group_16.293.766330969
Mon Apr 16 15:02:58 2012
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface TYPE 1 eth1 10.0.0.0 configured FROM OCR FOR USE AS a cluster interconnect
Interface TYPE 1 eth0 192.168.1.0 configured FROM OCR FOR USE AS a public interface
Picked latch-free SCN scheme 3
Autotune OF undo retention IS turned ON.
LICENSE_MAX_USERS = 0
SYS auditing IS disabled
ksdpec: called FOR event 13740 prior TO event GROUP initialization
Starting up ORACLE RDBMS Version: 10.2.0.4.0.
System parameters WITH non-DEFAULT VALUES:
processes = 1500
sessions = 1655
sga_max_size = 19830669312
pre_page_sga = FALSE
lock_sga = FALSE
__shared_pool_size = 3674210304
__large_pool_size = 16777216
__java_pool_size = 16777216
__streams_pool_size = 33554432
spfile = +DATA/orcl/spfileorcl.ora
sga_target = 19830669312
control_files = +DATA/orcl/controlfile/CURRENT.274.720740395
db_block_size = 8192
__db_cache_size = 16072572928
compatible = 10.2.0.3.0
log_archive_dest_1 = LOCATION=+DATA/
log_archive_format = %t_%s_%r.dbf
db_file_multiblock_read_count= 16
cluster_database = TRUE
cluster_database_instances= 4
db_create_file_dest = +DATA
_gc_affinity_time = 0
_gc_affinity_limit = 10000000
_gc_affinity_minimum = 10000000
thread = 1
instance_number = 1
undo_management = AUTO
undo_tablespace = UNDOTBS1
remote_login_passwordfile= EXCLUSIVE
db_domain =
dispatchers = (PROTOCOL=TCP) (SERVICE=orclXDB)
local_listener = (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.1.21)(PORT = 1521))
remote_listener = LISTENERS_ORCL
job_queue_processes = 20
cursor_sharing = FORCE
background_dump_dest = /u01/app/oracle/admin/orcl/bdump
user_dump_dest = /u01/app/oracle/admin/orcl/udump
core_dump_dest = /u01/app/oracle/admin/orcl/cdump
audit_file_dest = /u01/app/oracle/admin/orcl/adump
db_name = orcl
open_cursors = 1000
pga_aggregate_target = 5872025600
Cluster communication IS configured TO USE the following interface(s) FOR this instance
10.0.0.11
Mon Apr 16 15:02:59 2012
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
PMON started WITH pid=2, OS id=20588
DIAG started WITH pid=3, OS id=20590
PSP0 started WITH pid=4, OS id=20592
LMON started WITH pid=5, OS id=20594
LMD0 started WITH pid=6, OS id=20596
LMS0 started WITH pid=7, OS id=20603
LMS1 started WITH pid=8, OS id=20607
LMS2 started WITH pid=9, OS id=20611
LMS3 started WITH pid=10, OS id=20615
MMAN started WITH pid=11, OS id=20619
DBW0 started WITH pid=12, OS id=20621
DBW1 started WITH pid=13, OS id=20623
LGWR started WITH pid=14, OS id=20625
CKPT started WITH pid=15, OS id=20627
SMON started WITH pid=16, OS id=20629
RECO started WITH pid=17, OS id=20631
CJQ0 started WITH pid=18, OS id=20633
MMON started WITH pid=19, OS id=20635
Mon Apr 16 15:03:00 2012
starting up 1 dispatcher(s) FOR network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
MMNL started WITH pid=20, OS id=20637
Mon Apr 16 15:03:00 2012
starting up 1 shared server(s) ...
Mon Apr 16 15:03:02 2012
lmon registered WITH NM - instance id 1 (internal mem no 0)
Mon Apr 16 15:03:05 2012
Reconfiguration started (OLD inc 0, NEW inc 30)
List OF nodes:
0 1 2 3
Global Resource Directory frozen
* allocate DOMAIN 0, invalid = TRUE
Communication channels reestablished
* DOMAIN 0 valid according TO instance 3
* DOMAIN 0 valid = 1 according TO instance 1
Mon Apr 16 15:03:05 2012
Master broadcasted resource hash VALUE bitmaps
Non-LOCAL Process blocks cleaned OUT
Mon Apr 16 15:03:05 2012
LMS 0: 0 GCS shadows cancelled, 0 closed
Mon Apr 16 15:03:05 2012
LMS 1: 0 GCS shadows cancelled, 0 closed
Mon Apr 16 15:03:05 2012
LMS 2: 0 GCS shadows cancelled, 0 closed
Mon Apr 16 15:03:05 2012
LMS 3: 0 GCS shadows cancelled, 0 closed
SET master node info
Submitted ALL remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
ALL grantable enqueues GRANTED
Mon Apr 16 15:03:07 2012
LMS 2: 0 GCS shadows traversed, 0 replayed
Mon Apr 16 15:03:07 2012
LMS 1: 0 GCS shadows traversed, 0 replayed
Mon Apr 16 15:03:07 2012
LMS 3: 0 GCS shadows traversed, 0 replayed
Mon Apr 16 15:03:07 2012
LMS 0: 0 GCS shadows traversed, 0 replayed
Mon Apr 16 15:03:07 2012
Submitted ALL GCS remote-cache requests
Post SMON TO START 1st pass IR
Fix WRITE IN gcs resources
Reconfiguration complete
LCK0 started WITH pid=23, OS id=20699
Mon Apr 16 15:03:12 2012
ALTER DATABASE MOUNT
Mon Apr 16 15:03:12 2012
Starting background process ASMB
ASMB started WITH pid=25, OS id=20710
Starting background process RBAL
RBAL started WITH pid=26, OS id=20714
Mon Apr 16 15:03:17 2012
SUCCESS: diskgroup DATA was mounted
Mon Apr 16 15:03:21 2012
Setting recovery target incarnation TO 2
Mon Apr 16 15:03:21 2012
SUCCESS: diskgroup DATA was dismounted
Mon Apr 16 15:03:21 2012
ORA-1105 signalled during: ALTER DATABASE MOUNT...

这个ORA-1105错误只是说明当前实例的某些参数设置和RAC其他实例设置的不符,并不能说明导致错误的真正原因。
为了找到问题只有手工启动实例:

[oracle@rac1 ~]$ sqlplus / AS sysdba
SQL*Plus: Release 10.2.0.4.0 - Production ON 星期一 416 17:04:54 2012
Copyright (c) 1982, 2007, Oracle. ALL Rights Reserved.
Connected TO an idle instance.
SQL> STARTUP MOUNT
ORACLE instance started.
Total System Global Area 1.9831E+10 bytes
Fixed SIZE 2119216 bytes
Variable SIZE 3741321680 bytes
DATABASE Buffers 1.6073E+10 bytes
Redo Buffers 14655488 bytes
ORA-01105: mount IS incompatible WITH mounts BY other instances
ORA-01606: gc_files_to_locks NOT identical TO that OF another mounted instance

通过手工执行,可以了解具体导致错误产生的原因。不过gc_files_to_locks并没有设置为不同的值:

SQL> SHOW parameter gc_files_to_locks
NAME                TYPE                VALUE
------------------- ------------------- -----------------
gc_files_to_locks   string
SQL> SELECT sid, name, VALUE FROM v$spparameter WHERE name = 'gc_files_to_locks'; 
SID        NAME                           VALUE
---------- ------------------------------ --------------------------------
*          gc_files_to_locks

不过导致问题产生的确实与GC设置有关,问题并非是gc_files_to_locks参数导致,而是SPFILE中设置的_gc_affinity_time参数。这个参数是静态参数,只有重启后才能生效,而在SPFILE中设置后,会导致重启的实例1生效了该参数,因此和没有重启过的其他实例产生了不兼容。
解决方法有两个,一个是重启所有的节点,另外一个是去掉SPFILE中这个参数的设置:

SQL> ALTER system reset "_gc_affinity_time" scope = spfile sid = '*';
System altered.
SQL> shutdown immediate
ORA-01507: DATABASE NOT mounted
 
ORACLE instance shut down.
SQL> startup mount
ORACLE instance started.
Total System Global Area 1.9831E+10 bytes
Fixed SIZE 2119216 bytes
Variable SIZE 3741321680 bytes
DATABASE Buffers 1.6073E+10 bytes
Redo Buffers 14655488 bytes
DATABASE mounted.
SQL> ALTER DATABASE OPEN;
DATABASE altered.
Posted in ORACLE | Tagged , , , , | Leave a comment

10G开始Oracle区分物化视图和表

在9i以前,很多功能都是不区分表和物化视图的区别的,到了10g以后。很多功能会将表和物化视图区分对待。
原本通用的COMMENT ON TABLE语句,对物化视图不再有效,必须要使用COMMENT ON MATERIALIZED VIEW语句代替。

SQL> SELECT * FROM V$VERSION;
BANNER
----------------------------------------------------------------
Oracle DATABASE 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE 10.2.0.4.0 Production
TNS FOR Linux: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production
SQL> CREATE TABLE T_BASE (ID NUMBER, TYPE VARCHAR2(30), NUM NUMBER);
TABLE created.
SQL> CREATE MATERIALIZED VIEW LOG ON T_BASE
2 WITH ROWID, SEQUENCE (TYPE, NUM)
3 INCLUDING NEW VALUES;
Materialized VIEW log created.
SQL> CREATE MATERIALIZED VIEW MV_BASE
2 REFRESH FAST ENABLE QUERY REWRITE AS
3 SELECT TYPE, SUM(NUM) SUM_NUM, COUNT(NUM) CNT_NUM, COUNT(*) CNT
4 FROM T_BASE
5 GROUP BY TYPE;
Materialized VIEW created.
SQL> COMMENT ON TABLE MV_BASE IS 'COMMENT ON A MATERIALIZED VIEW ';
COMMENT ON TABLE MV_BASE IS 'COMMENT ON A MATERIALIZED VIEW '
*
ERROR at line 1:
ORA-12098: cannot comment ON the materialized VIEW
 
SQL> COMMENT ON MATERIALIZED VIEW MV_BASE IS 'COMMENT ON A MATERIALIZED VIEW ';
Comment created.
SQL> COL COMMENTS FOR A60
SQL> SELECT * FROM USER_MVIEW_COMMENTS;
MVIEW_NAME                     COMMENTS
------------------------------ ------------------------------------------------------------
MV_BASE                        COMMENT ON A MATERIALIZED VIEW

其实不只是COMMENT发生了变化,关于物化视图的执行计划Oracle也对其进行细化,将物化视图的扫描和全表扫描区分开:

SQL> SET AUTOT ON EXP
SQL> SELECT COUNT(*) FROM MV_BASE;
  COUNT(*)
----------
         0
Execution Plan
----------------------------------------------------------
Plan hash VALUE: 3034976462
-------------------------------------------------------------------------
| Id  | Operation             | Name    | ROWS  | Cost (%CPU)| TIME     |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |     1 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE       |         |     1 |            |          |
|   2 |   MAT_VIEW ACCESS FULL| MV_BASE |     1 |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------
Note
-----
   - dynamic sampling used FOR this statement
SQL> SELECT /*+ REWRITE */ TYPE, COUNT(*) FROM T_BASE GROUP BY TYPE;
no ROWS selected
Execution Plan
----------------------------------------------------------
Plan hash VALUE: 1008429399
----------------------------------------------------------------------------------------
| Id  | Operation                    | Name    | ROWS  | Bytes | Cost (%CPU)| TIME     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |         |     1 |    30 |     2   (0)| 00:00:01 |
|   1 |  MAT_VIEW REWRITE ACCESS FULL| MV_BASE |     1 |    30 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------------
Note
-----
   - dynamic sampling used FOR this statement

在9i以前,很难从执行计划中区分扫描的是表还是物化视图,但是现在一目了然了。
总的来说,这种改进还是很有意义的,用户可以更清楚的了解处理的对象到底是表还是物化视图。

Posted in ORACLE | Tagged , , , , | Leave a comment

ORA-600(qertbFetchByRowID)错误

以前也碰到过这个错误,不过由于无法直接连接数据库,没有办法深入分析。这次虽然同样没有办法连接数据库,但是获取了更详细的信息,基本确定导致问题的原因。
错误信息如下:

Wed May 09 10:41:32 2012
Errors IN file /oracle/admin/orcl/udump/orcl_ora_423466.trc:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []

其实从错误信息不难分析,导致问题的原因和索引有关,错误信息明确指出了问题发生在通过ROWID信息读取表时,而通过ROWID读取是索引扫描的标志。
一般来说这种问题是索引和表上数据的逻辑损坏造成的,如果问题出现在索引上,那么最简单的办法莫过于删除重建索引,或者使用ONLINE方式重建索引。
将索引删除后,问题果然不再出现,但是尝试重建索引时,引发了另一个ORA-600错误:

SQL> analyze TABLE tab_a validate STRUCTURE cascade online;
analyze TABLE tab_a validate STRUCTURE cascade online
ORA-01499: TABLE/INDEX CROSS reference failure - see trace file
SQL> ALTER INDEX ind_a rebuild online parallel 4;
ALTER INDEX ind_a rebuild online parallel 4
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []

告警日志中信息为:

Errors IN file /oracle/admin/orcl/udump/orcl_ora_618990.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []

详细TRACE中信息为:

*** 2012-05-09 13:52:41.518
ksedmp: internal OR fatal error
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
CURRENT SQL statement FOR this SESSION:
ALTER INDEX ind_a rebuild online parallel 4
----- Call Stack Trace -----
calling              CALL     entry                argument VALUES IN hex      
location             TYPE     point                (? means dubious VALUE)     
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              00000FFFF ? 000000000 ?
ksedmp+0290          bl       ksedst               104A557A8 ?
ksfdmp+0018          bl       03F4CD24             
kgerinv+00dc         bl       _ptrgl               
kgeasnmierr+004c     bl       kgerinv              70000002F81807C ? 0000000B8 ?
                                                   1F400000000 ? 110546478 ?
                                                   110546068 ?
kdsgrp+0460          bl       01FB9490             
kdsgnp+0444          bl       kdsgrp               70000002F8184ED ? 000000000 ?
                                                   110546070 ?
kdstf0011101km+0c24  bl       03F4A5C0             
kdsttgr+5244         bl       kdstf0011101km       110546068 ? 000010266 ?
                                                   70000022DCBB510 ?
                                                   FFFFFFFFFFF54F8 ? 0FFFFFFFF ?
                                                   000001FE8 ? 000000000 ?
                                                   000000000 ?
qertbFetch+09b8      bl       kdsttgr              110546068 ?
                                                   4842208000000001 ?
                                                   804000007BFB0 ? 000000000 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ? C4FB00000000 ?
qergiFetch+02a4      bl       03F49A98             
qertqoFetch+01f0     bl       01FB9488             
qertqiFetch+0158     bl       01FB9488             
qersoFetch+0110      bl       03F49A98             
qerliFetch+0188      bl       03F49A98             
qertqoFetch+01f0     bl       01FB9488             
qerpxFetch+0210      bl       01FB9488             
kdicrws+16e8         bl       01FB9488             
kdicdrv+09b8         bl       kdicrws              70000022DFE4DC0 ?
                                                   70000022DFE5500 ?
                                                   70000022DFE5500 ?
opiexe+28c4          bl       kdicdrv              B00000009 ? 100000001 ?
                                                   11043C340 ?
opiosq0+19f0         bl       opiexe               FFFFFFFFFFF8E70 ?
                                                   2824424042420820 ?
                                                   FFFFFFFFFFF8F30 ?
kpooprx+0168         bl       opiosq0              33C0D81A0 ? 1101957F8 ?
                                                   1100098E0 ? A4000110199810 ?
kpoal8+0400          bl       kpooprx              FFFFFFFFFFFB784 ?
                                                   FFFFFFFFFFFB508 ?
                                                   3B0000003B ? 100000001 ?
                                                   000000000 ? A40000000000A4 ?
                                                   000000000 ? 1103A0018 ?
opiodr+0ae0          bl       _ptrgl               
ttcpip+1020          bl       _ptrgl               
opitsk+1124          bl       ttcpip               1100CAB28 ? 000000000 ?
                                                   000000001 ? FFFFFFFFFFFB8B0 ?
                                                   000000001 ? 1100DFE78 ?
                                                   FFFFFFFFFFFB830 ? 11046F998 ?
opiino+0990          bl       opitsk               0FFFFD7C0 ? 000000000 ?
opiodr+0ae0          bl       _ptrgl               
opidrv+0484          bl       01FBA8D8             
sou2o+0090           bl       opidrv               3C02ABCD9C ? 4A039F928 ?
                                                   FFFFFFFFFFFF6C0 ?
opimai_real+01bc     bl       01FB868C             
main+0098            bl       opimai_real          000000000 ? 000000000 ?
__start+0090         bl       main                 000000000 ? 000000000 ?
--------------------- Binary Stack Dump ---------------------

可以看到,数据的损坏出在表块上,因此重建索引会导致错误。如果问题出现在表块上,那么除了利用DBMS_REPAIR包之类的进行修复外,还可以通过简单的PL/SQL将正常的数据导出。由于数据存在坏块,直接CREATE TABLE AS SELECT采用直接路径方式,会绕过数据的校验,因此错误数据仍然会插入。而EXP和EXPDP在导出的时候同样会报错,因此只有通过PL/SQL单条记录读取,如果正常就插入到目标表,如果出现读取异常,那么通过异常判断来过滤该错误。
ORA-600[kdsgrp1]错误:http://yangtingkun.itpub.net/post/468/524144

Posted in BUG | Tagged , , , , | Leave a comment

安装ORACLE时在Linux上设置内核参数的含义

前两天看到一篇Redhat官方的Oracle安装文档,对于Linux内核参数的修改描述的非常清晰。
安装Oracle之前,除了检查操作系统的硬件和软件是否满足安装需要之外,一个重点就是修改内核参数,其中最主要的是和内存相关的参数设置。
SHMMAX参数:Linux进程可以分配的单独共享内存段的最大值。一般设置为内存总大小的一半。这个值的设置应该大于SGA_MAX_TARGET或MEMORY_MAX_TARGET的值,因此对于安装Oracle数据库的系统,shmmax的值应该比内存的二分之一大一些。

# grep MemTotal /proc/meminfo
# cat /proc/sys/kernel/shmmax

上面的命令是检查系统内存的大小,以及当前shmmax的设置。

# echo 21474836480 > /proc/sys/kernel/shmmax
# sysctl -w kernel.shmmax=21474836480
# echo "kernel.shmmax=21474836480" >> /etc/sysctl.conf

这是设置shmmax参数的几种方法,这三种方式都可以将shmmax设置为20G。这个参数的修改可以不重启数据库。个人推荐使用第二种sysctl命令的方式。采用第三种方式需要执行sysctl –t操作或重启,但是为了确保下次重启后设置值仍然生效,第三种方式是必不可少的。前两种方式类似alter system set scope = memory,而第三种方式则类似alter system set scope = spfile。
SHMMNI参数:设置系统级最大共享内存段数量。Oracle10g推荐最小值为4096,可以适当比4096增加一些。

# cat /proc/sys/kernel/shmmni
# echo 4096 > /proc/sys/kernel/shmmni
# sysctl -w kernel.shmmni=4096
# echo "kernel.shmmni=4096" >> /etc/sysctl.conf

检查和设置方法如上,这和shmmax的修改方式没有区别,不在赘述。
SHMALL参数:设置共享内存总页数。这个值太小有可能导致数据库启动报错。很多人调整系统内核参数的时候只关注SHMMAX参数,而忽略了SHMALL参数的设置。这个值推荐设置为物理内存大小除以分页大小。

# getconf PAGE_SIZE

通过getconf获取分页的大小,用来计算SHMALL的合理设置值:

SQL> SELECT 32*1024*1024*1024/4096 FROM dual;
32*1024*1024*1024/4096
----------------------
               8388608

对于32G的内存,4K分页大小的系统而言,SHMALL的值应该设置为8388608。

# cat /proc/sys/kernel/shmall
# echo 8388608 > /proc/sys/kernel/shmall
# sysctl -w kernel.shmall=8388608
# echo " kernel.shmall=8388608" >> /etc/sysctl.conf

查询和设置方法如上。
信号灯semaphores是进程或线程间访问共享内存时提供同步的计数器。
SEMMSL参数:设置每个信号灯组中信号灯最大数量,推荐的最小值是250。对于系统中存在大量并发连接的系统,推荐将这个值设置为PROCESSES初始化参数加10。
SEMMNI参数:设置系统中信号灯组的最大数量。Oracle10g和11g的推荐值为142。
SEMMNS参数:设置系统中信号灯的最大数量。操作系统在分配信号灯时不会超过LEAST(SEMMNS,SEMMSL*SEMMNI)。事实上,如果SEMMNS的值超过了SEMMSL*SEMMNI是非法的,因此推荐SEMMNS的值就设置为SEMMSL*SEMMNI。Oracle推荐SEMMNS的设置不小于32000,假如数据库的PROCESSES参数设置为600,则SEMMNS的设置应为:

SQL> SELECT (600+10)*142 FROM dual;
(600+10)*142
------------
       86620

SEMOPM参数:设置每次系统调用可以同时执行的最大信号灯操作的数量。由于一个信号灯组最多拥有SEMMSL个信号灯,因此有推荐将SEMOPM设置为SEMMSL的值。Oracle验证的10.2和11.1的SEMOPM的配置为100。
通过下面的命令可以检查信号灯相关配置:

# cat /proc/sys/kernel/sem
250 32000 100 128

对应的4个值从左到右分别为SEMMSL、SEMMNS、SEMOPM和SEMMNI。修改方法为:

# echo 610 86620 100 142 > /proc/sys/kernel/sem
# sysctl -w kernel.sem="610 86620 100 142"
# echo "kernel.sem=610 86620 100 142" >> /etc/sysctl.conf
Posted in ORACLE | Tagged , , , , , , , , | Leave a comment

9i出现内存分配错误

在9.2.0.8 RAC环境中,告警日志频繁出现内存分配错误的信息。
错误信息如下:

Wed May 2 10:20:43 2012
Trace dumping IS performing id=[cdmp_20120502102043]
There are 1444 memory allocation errors FOR object-level stat
IN the LAST 15 minutes
.
.
.
There are 1605 memory allocation errors FOR object-level stat
IN the LAST 15 minutes

在另外一个节点上则出现了ORA-4031错误:

Sun Apr 29 22:30:52 2012
Errors IN file /oracle/admin/orcl/bdump/orcl2_j000_1204632.trc:
ORA-12012: error ON auto EXECUTE OF job 527
ORA-04031: unable TO allocate 4224 bytes OF shared memory ("shared pool","unknown object","sga heap(2,0)","object level s")
ORA-06512: at "PERFSTAT.STATSPACK", line 2479
ORA-06512: at "PERFSTAT.STATSPACK", line 91
ORA-06512: at line 1
Sun Apr 29 23:03:14 2012
Errors IN file /oracle/admin/orcl/bdump/orcl2_j001_1528536.trc:
ORA-12012: error ON auto EXECUTE OF job 527
ORA-04031: unable TO allocate 4256 bytes OF shared memory ("shared pool","unknown object","sga heap(2,0)","obj stat memor")
ORA-06512: at "PERFSTAT.STATSPACK", line 2479
ORA-06512: at "PERFSTAT.STATSPACK", line 91
ORA-06512: at line 1
There IS 1 memory allocation error FOR object-level stat
IN the LAST 15 minutes
There are 66 memory allocation errors FOR object-level stat
IN the LAST 15 minutes
There are 8 memory allocation errors FOR object-level stat
IN the LAST 15 minutes
Mon Apr 30 00:07:38 2012
Errors IN file /oracle/admin/orcl/bdump/orcl2_j001_942754.trc:
ORA-12012: error ON auto EXECUTE OF job 527
ORA-04031: unable TO allocate 4256 bytes OF shared memory ("shared pool","unknown object","sga heap(2,0)","obj stat memor")
ORA-06512: at "PERFSTAT.STATSPACK", line 2479
ORA-06512: at "PERFSTAT.STATSPACK", line 91
ORA-06512: at line 1
There are 354 memory allocation errors FOR object-level stat
IN the LAST 15 minutes
There are 653 memory allocation errors FOR object-level stat
IN the LAST 15 minutes
Mon Apr 30 02:07:45 2012
Thread 2 advanced TO log SEQUENCE 71531
CURRENT log# 5 seq# 71531 mem# 0: /dev/rlv_redo22_128m
Mon Apr 30 02:07:45 2012
ARC0: Evaluating archive log 4 thread 2 SEQUENCE 71530
ARC0: Beginning TO archive log 4 thread 2 SEQUENCE 71530
Creating archive destination LOG_ARCHIVE_DEST_1: '/archive2/2_71530.dbf'
ARC0: Completed archiving log 4 thread 2 SEQUENCE 71530
There are 41 memory allocation errors FOR object-level stat
IN the LAST 15 minutes

ORA-4031错误比较常见,简单的说就是内存不足所致。从上面的错误信息看,内存不足发生在共享池上。而Memory allocation errors同样是共享池空间不足的表现,可以参考文章Memory Allocation Errors For Object-Level Stat Appearing in the Alert Log File [ID 757895.1]。
针对这个问题,可以通过设置隐含参数_OBJECT_STATISTICS=FALSE来禁用对象级统计信息,但是最彻底的方法还是增加共享池空间。

Posted in BUG | Tagged , , , | Leave a comment

ORA-600(kgavsd_3)错误

客户9208环境出现ORA-600[kgavsd_3]错误。
错误信息为:

Thu DEC 15 15:47:57 2011
Errors IN file /oracle/admin/orcl/udump/orcl2_ora_659472.trc:
ORA-00600: internal error code, arguments: [kgavsd_3], [0], [], [], [], [], [], []
ORA-00604: error occurred at recursive SQL level 2
ORA-03113: end-of-file ON communication channel
ORA-03113: end-of-file ON communication channel
ORA-06512: at "DB_USER.P_JOB", line 79
ORA-06512: at line 3
ORA-06508: PL/SQL: could NOT find program unit being called
ORA-06512: at line 1
Thu DEC 15 15:47:58 2011
Trace dumping IS performing id=[cdmp_20111215154758]

类似的错误还出现过两次,一次也是ORA-3113错误导致的:

Thu DEC 15 16:49:00 2011
Errors IN file /oracle/admin/orcl/udump/orcl2_ora_906034.trc:
ORA-00600: internal error code, arguments: [kgavsd_3], [0], [], [], [], [], [], []
ORA-00604: error occurred at recursive SQL level 1
ORA-03113: end-of-file ON communication channel
Thu DEC 15 16:49:02 2011
Trace dumping IS performing id=[cdmp_20111215164902]

而另一次是被用户取消所致:

Mon Jan 2 18:21:50 2012
Errors IN file /oracle/admin/orcl/udump/orcl2_ora_630794.trc:
ORA-00600: internal error code, arguments: [kgavsd_3], [0], [], [], [], [], [], []
ORA-00604: error occurred at recursive SQL level 1
ORA-01013: USER requested cancel OF CURRENT operation
Mon Jan 2 18:21:51 2012
Trace dumping IS performing id=[cdmp_20120102182151]

根据这三个错误综合判断,可以判断问题和Ora-600 kgavsd_3 [ID 403575.1]描述的一致,客户使用的多半是SQL DEVELOPER之类的工具,在使用DEBUG调试PL/SQL代码是出现了错误或中止了会话,导致DBMS_PIPE调用中断。
对于这个错误可以安全的将其忽略掉。

Posted in BUG | Tagged , , , , , , | Leave a comment

节点重配置后IPC发送超时问题

客户RAC环境在一个节点重启后,另一个节点出现IPC send timeout信息。
详细错误信息为:

Wed May 2 22:07:00 2012
IPC Send timeout detected.Sender: ospid 20808
Receiver: inst 1 binc 1718095761 ospid 16263
Wed May 2 22:07:02 2012
IPC Send timeout detected.Sender: ospid 6677
Receiver: inst 1 binc 1718095761 ospid 16263
Wed May 2 22:07:09 2012
IPC Send timeout detected.Sender: ospid 16758
Receiver: inst 1 binc 1718096035 ospid 16261
Wed May 2 22:07:13 2012
IPC Send timeout detected.Sender: ospid 8947
Receiver: inst 1 binc 1718095761 ospid 16263
Wed May 2 22:07:13 2012
IPC Send timeout detected.Sender: ospid 6583
Receiver: inst 1 binc 1718095761 ospid 16263
Wed May 2 22:07:31 2012
IPC Send timeout TO 0.0 inc 24 FOR msg TYPE 12 FROM opid 132
Wed May 2 22:07:31 2012
IPC Send timeout detected.Sender: ospid 17068
Receiver: inst 1 binc 1718095761 ospid 16263
Wed May 2 22:07:34 2012
Communications reconfiguration: instance_number 1
Wed May 2 22:07:34 2012
IPC Send timeout TO 0.0 inc 24 FOR msg TYPE 12 FROM opid 154
Wed May 2 22:07:45 2012
IPC Send timeout TO 0.0 inc 24 FOR msg TYPE 12 FROM opid 64
Wed May 2 22:07:45 2012
IPC Send timeout TO 0.0 inc 24 FOR msg TYPE 12 FROM opid 95
Wed May 2 22:07:54 2012
IPC Send timeout detected.Sender: ospid 21078
Receiver: inst 1 binc 1718095761 ospid 16263
Wed May 2 22:07:59 2012
IPC Send timeout TO 0.0 inc 24 FOR msg TYPE 12 FROM opid 24
Wed May 2 22:08:04 2012
Trace dumping IS performing id=[cdmp_20120502220729]
Wed May 2 22:08:24 2012
IPC Send timeout TO 0.0 inc 24 FOR msg TYPE 12 FROM opid 146
Wed May 2 22:08:36 2012
Trace dumping IS performing id=[cdmp_20120502220805]
Wed May 2 22:08:38 2012
Trace dumping IS performing id=[cdmp_20120502220805]
Wed May 2 22:10:55 2012
Evicting instance 1 FROM cluster
Wed May 2 22:11:32 2012
Waiting FOR instances TO leave: 
1

这个信息并不正常,查询MOS后发现,这是一个bug,问题描述可以参考:’IPC Send Timeout Detected’ errors between QMON Processes after RAC reconfiguration [ID 458912.1]。
对于当前的10.2.0.4环境,需要针对Bug 6200820进行PATCH修正,而对于10.2.0.3版本则需要应用Patch 6326889。
在MOS中查到不少类似IPC Timeout的问题,多数都会影响10.2.0.4版本,且大部分都在10.2.0.5中被fixed,因此如果这个问题出现频繁,升级到10.2.0.5也是一个不错的选择。

Posted in BUG | Tagged , , , | Leave a comment