ODA一键式升级

ODA另外一个值得称道的方便之处,就是一键式升级。
数据库的升级本身就是比较麻烦的事情,不谈跨大版本的升级,仅仅是PSR的升级也包括很多的步骤,即使是一个PSU的升级,同样也不是一个简单的工作。如果是RAC环境,那么恭喜你,工作量DOUBLE都不止。
那么对于ODA而言,除了RAC架构之外,还有自身的ILON以及管理工具要维护,因此整个环境的升级一定是一个非常烦琐的工作。而ODA的最大目标就是简化工作,因此整个ODA硬件及软件环境这个非常烦琐的操作被简化为几个命令完成。
当前的ODA管理工具oak版本为2.2.0.0.0,包含的数据库版本为11.2.0.3.2,下面的测试将oak升级到2.3.0.0.0,而数据库的版本相应的升级到11.2.0.3.3。
首先需要从MOS上下载补丁文件:p13982331_23000_Linux-x86-64.zip,这个文件打包了所有ODA升级需要的文件。
上传到/tmp目录后,通过unpack选项进行解压:

[root@odaenmo1 ~]# cd /opt/oracle/oak/bin
[root@odaenmo1 bin]# ./oakcli unpack –package /tmp/p13982331_23000_Linux-x86-64.zip
Unpacking takes a while,  pls wait....
Successfully unpacked the files TO repository.

解压完成后,就可以执行升级PATCH的操作了:

[root@odaenmo1 bin]# ./oakcli UPDATE –patch 2.3.0.0.0

执行完这个操作,Oracle会提示在升级过程中,DB/ASM/CLUSTERWARE都会被停止,选择确认后,Oracle要求确认是否在节点2上同样执行了unpack操作。
输入root的密码后,Oracle开始对HMP、OAK、IPMI和STORAGE进行升级操作。
不过这个升级过程并不会升级GRID和DB,如果要将CLUSTERWARE和DB同样升级到最新的版本,应该在上面的升级结束后执行下面的命令:

[root@odaenmo1 bin]# ./oakcli UPDATE –patch 2.3.0.0.0 --gi
[root@odaenmo1 bin]# ./oakcli UPDATE –patch 2.3.0.0.0 --database

虽然ODA是RAC的架构,从理论上讲,具备了ROLLING UPGRATE的能力来提高系统的可用性,但是ODA的升级并没有考虑这么复杂的功能。
显然Oracle为了操作简化这个最关键的目标,牺牲了升级过程中的系统可用性。整个升级过程包括GRID和DB的升级不到1个小时,这个停机时间对于绝大部分应用而言,还是可以接受的。

Posted in ORACLE | Tagged , , , , , , , , , , | Leave a comment

编译oracle时报错找不到loraolap10

由于OLAP组件的问题,尝试编译oracle,碰到loraolap10找不到的问题。
数据库版本为10.2.0.4 FOR HP-UX,在尝试通过编译oracle可执行文件来关闭OLAP组件时,出现错误,信息为:

ld: cannot find -loraolap10

参考MOS文档Linking Oracle fails with ld: cannot find -loraolap10 [ID 435912.1],导致问题的原因在于libknlopt.a文件中包含了不正确的xsyeolap.so文件。这个问题可能是由于卸载OLAP时没有像预期那样正确的完成,或者是在安装过程中配置环境出现了异常。
可以通过下面的方法来改正这个问题,并重新编译oracle:

$ cd $ORACLE_HOME/rdbms/lib
$ cp libknlopt.a libknlopt.a_save
$ ar d libknlopt.a xsyeolap.o
$ ar cr libknlopt.a xsnoolap.o
$ relink oracle
Posted in ORACLE | Tagged , , , , , | Leave a comment

数据库异机恢复打开报错ORA-36961

客户在异机进行数据库的不完全恢复,用来导出被误删除的表,恢复完成后,打开数据库时出现了ORA-36961错误。
数据库版本为10.2.0.4 for HP-UX,打开数据库前台出现ORA-3113错误:

SQL> startup
Total System Global Area 4294967296 bytes
Fixed SIZE 2073472 bytes
Variable SIZE 805309568 bytes
DATABASE Buffers 3472883712 bytes
Redo Buffers 14700544 bytes
DATABASE mounted.
ORA-03113: end-of-file ON communication channel

在启动过程中,如果前台出现ORA-3113错误,那么后台多半是ORA-600或ORA-7445错误,检查告警日志,却发现出现了ORA-36961错误:

Error IN executing triggers ON DATABASE startup
ksedmp: internal OR fatal error
ORA-00604: error occurred at recursive SQL level 1
ORA-12663: Services required BY client NOT available ON the server
ORA-36961: Oracle OLAP IS NOT available.
ORA-06512: at "SYS.OLAPIHISTORYRETENTION", line 1
ORA-06512: at line 15
Error IN executing triggers ON instance shutdown
ksedmp: internal OR fatal error
ORA-00604: error occurred at recursive SQL level 1
ORA-12663: Services required BY client NOT available ON the server
ORA-36961: Oracle OLAP IS NOT available.
ORA-06512: at "SYS.OLAPIHISTORYRETENTION", line 1
ORA-06512: at line 15

显然这个错误和OLAP组件的不正常有关。由于是异机恢复,很可能新的ORACLE_HOME安装的OLAP组件与原始环境不一致,根据文档’Ora-36961 Oracle OLAP is Not Available’ error During database Startup [ID 1291894.1],需要启用OLAP组件。由于OLAP组件所需要的xsyeolap.o文件缺失,需要从安装介质中获取。首先需要从介质盘的目录中拷贝Disk1/stage/Patches/oracle.rdbms.olap/10.2.0.5.0/1/DataFiles/filegroup1.1.1.jar文件到临时目录,然后抽取文件$ORACLE_HOME/jdk/bin/jar -xvf filegroup1.1.1.jar,最后将文件xsyeolap.o拷贝到ORACLE_HOME/rdbms/lib:

% cd $ORACLE_HOME/rdbms/lib
% make -f ins_rdbms.mk olap_on ioracle

重启数据库既可。

Posted in ORACLE | Tagged , , , , , , , | Leave a comment

ORA-700(kcbl2vfyfh_action)错误

ORA-600和ORA-7445的错误已经司空见惯了,而Oracle的ORA-700错误还是第一次看到,而看到这个错误,还是在ODA环境中看到的。
数据库环境为ODA,在尝试配置FLASH_CACHE:

SQL> SHOW parameter flash_
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_flash_cache_file                  string
db_flash_cache_size big              INTEGER     0
db_flashback_retention_target        INTEGER     1440
SQL> ALTER system SET db_flash_cache_size = 60G scope = spfile;
System altered.
SQL> ALTER system SET db_flash_cache_file= '+REDO' scope = spfile;
System altered.

重新启动数据库后,发现了这个错误:

Tue Aug 14 02:37:13 2012
Encounter problem verifying flash cache +REDO/flash/abc.flash. Disable flash cache AND issue an ORA-700 FOR diagnostics
Errors IN file /u01/app/oracle/diag/rdbms/odaenmo/odaenmo1/trace/odaenmo1_gen0_25557.trc (incident=195241):
ORA-00700: 软内部错误, 参数: [kcbl2vfyfh_action], [db_flash_cache_file integrity CHECK failed], [], [], [], [], [], [], [], [], [], []
Incident details IN: /u01/app/oracle/diag/rdbms/odaenmo/odaenmo1/incident/incdir_195241/odaenmo1_gen0_25557_i195241.trc
Tue Aug 14 02:37:14 2012
Tue Aug 14 02:37:14 2012

这是实例1上高级日志的错误,下面是详细的TRACE信息:

Dump file /u01/app/oracle/diag/rdbms/odaenmo/odaenmo1/incident/incdir_195241/odaenmo1_gen0_25557_i195241.trc
Oracle DATABASE 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
WITH the Partitioning, REAL Application Clusters, Automatic Storage Management, OLAP,
DATA Mining AND REAL Application Testing options
ORACLE_HOME = /u01/app/oracle/product/11.2.0.3/dbhome_1
System name:    Linux
Node name:      odaenmo1
Release:        2.6.32-300.11.1.el5uek
Version:        #1 SMP Wed Mar 7 09:55:50 EST 2012
Machine:        x86_64
Instance name: odaenmo1
Redo thread mounted BY this instance: 1
Oracle process NUMBER: 5
Unix process pid: 25557, image: oracle@odaenmo1 (GEN0)
*** 2012-08-14 02:37:13.532
*** SESSION ID:(3131.1) 2012-08-14 02:37:13.532
*** CLIENT ID:() 2012-08-14 02:37:13.532
*** SERVICE NAME:(SYS$BACKGROUND) 2012-08-14 02:37:13.532
*** MODULE NAME:() 2012-08-14 02:37:13.532
*** ACTION NAME:() 2012-08-14 02:37:13.532
Dump continued FROM file: /u01/app/oracle/diag/rdbms/odaenmo/odaenmo1/trace/odaenmo1_gen0_25557.trc
ORA-00700: 软内部错误, 参数: [kcbl2vfyfh_action], [db_flash_cache_file integrity CHECK failed], [], [], [], [], [], [], [], [], [], []
========= Dump FOR incident 195241 (ORA 700 [kcbl2vfyfh_action]) ========
----- Beginning of Customized Incident Dump(s) -----
Flash cache parameter expected: db_unique_name odaenmo, dbid 196211168 inst_id 1, bpid 3, bsz 8192
File header db_unique_name odaenmo, dbid 196211168, inst_id 1, bpid 3, bsz 8192 filetype_str "Oracle RDBMS Flash Cache File"
----- End of Customized Incident Dump(s) -----
*** 2012-08-14 02:37:13.568
dbkedDefDump(): Starting incident DEFAULT dumps (flags=0x2, level=3, mask=0x0)
----- SQL Statement (None) -----
CURRENT SQL information unavailable - no cursor.
----- Call Stack Trace -----
calling              CALL     entry                argument VALUES IN hex      
location             TYPE     point                (? means dubious VALUE)     
-------------------- -------- -------------------- ----------------------------
skdstdst()+36        CALL     kgdsdst()            000000000 ? 000000000 ?
                                                   7FFF491BA3C8 ? 000000001 ?
                                                   000000001 ? 000000002 ?
ksedst1()+98         CALL     skdstdst()           000000000 ? 000000000 ?
                                                   7FFF491BA3C8 ? 000000001 ?
                                                   000000000 ? 000000002 ?
ksedst()+34          CALL     ksedst1()            000000000 ? 000000001 ?
                                                   7FFF491BA3C8 ? 000000001 ?
                                                   000000000 ? 000000002 ?
dbkedDefDump()+2741  CALL     ksedst()             000000000 ? 000000001 ?
                                                   7FFF491BA3C8 ? 000000001 ?
                                                   000000000 ? 000000002 ?
ksedmp()+36          CALL     dbkedDefDump()       000000003 ? 000000002 ?
                                                   7FFF491BA3C8 ? 000000001 ?
                                                   000000000 ? 000000002 ?
ksfdmp()+64          CALL     ksedmp()             000000003 ? 000000002 ?
                                                   7FFF491BA3C8 ? 000000001 ?
                                                   000000000 ? 000000002 ?
dbgexPhaseII()+1764  CALL     ksfdmp()             000000003 ? 000000002 ?
                                                   7FFF491BA3C8 ? 000000001 ?
                                                   000000000 ? 000000002 ?
dbgexExplicitEndInc  CALL     dbgexPhaseII()       7F3C50D2E710 ? 7F3C50D398C0 ?
()+750                                             7FFF491C5F00 ? 000000001 ?
                                                   000000000 ? 000000002 ?
dbgeEndDDEInvocatio  CALL     dbgexExplicitEndInc  7F3C50D2E710 ? 7F3C50D398C0 ?
nImpl()+767                   ()                   7FFF491C5F00 ? 000000001 ?
                                                   000000000 ? 000000002 ?
dbgeEndDDEInvocatio  CALL     dbgeEndDDEInvocatio  7F3C50D2E710 ? 7F3C50D398C0 ?
n()+47                        nImpl()              7FFF491C5F00 ? 000000001 ?
                                                   000000000 ? 000000002 ?
kcbl2vfyfh_action()  CALL     dbgeEndDDEInvocatio  7F3C50D2E710 ? 7F3C50D398C0 ?
+3373                         n()                  7FFF491C5F00 ? 000000001 ?
                                                   000000000 ? 000000002 ?
ksbabs()+771         CALL     kcbl2vfyfh_action()  7FFF491C7738 ? 000000018 ?
                                                   7FFF491C5F00 ? 000000001 ?
                                                   000000000 ? 000000002 ?
ksbrdp()+971         CALL     ksbabs()             7FFF491C7738 ? 000000018 ?
                                                   7FFF491C5F00 ? 000000001 ?
                                                   000000000 ? 000000002 ?
opirip()+618         CALL     ksbrdp()             7FFF491C7738 ? 000000018 ?
                                                   7FFF491C5F00 ? 000000001 ?
                                                   000000000 ? 000000002 ?
opidrv()+598         CALL     opirip()             000000032 ? 000000004 ?
                                                   7FFF491C88F8 ? 000000001 ?
                                                   000000000 ? 000000002 ?
sou2o()+98           CALL     opidrv()             000000032 ? 000000004 ?
                                                   7FFF491C88F8 ? 000000001 ?
                                                   000000000 ? 000000002 ?
opimai_real()+261    CALL     sou2o()              7FFF491C88D0 ? 000000032 ?
                                                   000000004 ? 7FFF491C88F8 ?
                                                   000000000 ? 000000002 ?
ssthrdmain()+252     CALL     opimai_real()        000000000 ? 7FFF491C8AC0 ?
                                                   000000004 ? 7FFF491C88F8 ?
                                                   000000000 ? 000000002 ?
main()+196           CALL     ssthrdmain()         000000003 ? 7FFF491C8AC0 ?
                                                   000000001 ? 000000000 ?
                                                   000000000 ? 000000002 ?
__libc_start_main()  CALL     main()               000000003 ? 7FFF491C8C60 ?
+244                                               000000001 ? 000000000 ?
                                                   000000000 ? 000000002 ?
_start()+36          CALL     __libc_start_main()  000A0B188 ? 000000001 ?
                                                   7FFF491C8C58 ? 000000000 ?
                                                   000000000 ? 000000003 ?
--------------------- Binary Stack Dump ---------------------

前面提到了ORA-600和ORA-7445这种内部错误很常见,而这种软内部错误的ORA-700错误还是第一次看到。
下面是节点2上的告警:

2012-08-14 02:37:13.192000 +08:00
Encounter problem verifying flash cache +REDO/flash/abc.flash. Disable flash cache AND issue an ORA-700 FOR diagnostics
Errors IN file /u01/app/oracle/diag/rdbms/odaenmo/odaenmo2/trace/odaenmo2_gen0_16906.trc (incident=416041):
ORA-00700: 软内部错误, 参数: [kcbl2vfyfh_action], [db_flash_cache_file integrity CHECK failed], [], [], [], [], [], [], [], [], [], []
Incident details IN: /u01/app/oracle/diag/rdbms/odaenmo/odaenmo2/incident/incdir_416041/odaenmo2_gen0_16906_i416041.trc
L2 cache disabled FOR dbwr 1
L2 cache disabled FOR dbwr 2
L2 cache file closed BY dbwr 0
L2 cache disabled FOR dbwr 0
Dumping diagnostic DATA IN directory=[cdmp_20120814023714], requested BY (instance=2, osid=16906 (GEN0)), summary=[incident=416041].
2012-08-14 02:37:18.880000 +08:00
Dumping diagnostic DATA IN directory=[cdmp_20120814023714], requested BY (instance=1, osid=25557 (GEN0)), summary=[incident=195241].
2012-08-14 02:37:55.075000 +08:00
Sweep [inc][416041]: completed
Sweep [inc2][416041]: completed

对应的详细TRACE文件为:

Dump file /u01/app/oracle/diag/rdbms/odaenmo/odaenmo2/incident/incdir_416041/odaenmo2_gen0_16906_i416041.trc
Oracle DATABASE 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
WITH the Partitioning, REAL Application Clusters, Automatic Storage Management, OLAP,
DATA Mining AND REAL Application Testing options
ORACLE_HOME = /u01/app/oracle/product/11.2.0.3/dbhome_1
System name:    Linux
Node name:      odaenmo2
Release:        2.6.32-300.11.1.el5uek
Version:        #1 SMP Wed Mar 7 09:55:50 EST 2012
Machine:        x86_64
Instance name: odaenmo2
Redo thread mounted BY this instance: 2
Oracle process NUMBER: 5
Unix process pid: 16906, image: oracle@odaenmo2 (GEN0)
*** 2012-08-14 02:37:13.249
*** SESSION ID:(3131.1) 2012-08-14 02:37:13.249
*** CLIENT ID:() 2012-08-14 02:37:13.249
*** SERVICE NAME:(SYS$BACKGROUND) 2012-08-14 02:37:13.249
*** MODULE NAME:() 2012-08-14 02:37:13.249
*** ACTION NAME:() 2012-08-14 02:37:13.249
Dump continued FROM file: /u01/app/oracle/diag/rdbms/odaenmo/odaenmo2/trace/odaenmo2_gen0_16906.trc
ORA-00700: 软内部错误, 参数: [kcbl2vfyfh_action], [db_flash_cache_file integrity CHECK failed], [], [], [], [], [], [], [], [], [], []
========= Dump FOR incident 416041 (ORA 700 [kcbl2vfyfh_action]) ========
----- Beginning of Customized Incident Dump(s) -----
Flash cache parameter expected: db_unique_name odaenmo, dbid 196211168 inst_id 2, bpid 3, bsz 8192
File header db_unique_name odaenmo, dbid 196211168, inst_id 1, bpid 3, bsz 8192 filetype_str "Oracle RDBMS Flash Cache File"
----- End of Customized Incident Dump(s) -----
*** 2012-08-14 02:37:13.284
dbkedDefDump(): Starting incident DEFAULT dumps (flags=0x2, level=3, mask=0x0)
----- SQL Statement (None) -----
CURRENT SQL information unavailable - no cursor.
----- Call Stack Trace -----
calling              CALL     entry                argument VALUES IN hex      
location             TYPE     point                (? means dubious VALUE)     
-------------------- -------- -------------------- ----------------------------
skdstdst()+36        CALL     kgdsdst()            000000000 ? 000000000 ?
                                                   7FFF7312DDF8 ? 000000001 ?
                                                   000000001 ? 000000002 ?
ksedst1()+98         CALL     skdstdst()           000000000 ? 000000000 ?
                                                   7FFF7312DDF8 ? 000000001 ?
                                                   000000000 ? 000000002 ?
ksedst()+34          CALL     ksedst1()            000000000 ? 000000001 ?
                                                   7FFF7312DDF8 ? 000000001 ?
                                                   000000000 ? 000000002 ?
dbkedDefDump()+2741  CALL     ksedst()             000000000 ? 000000001 ?
                                                   7FFF7312DDF8 ? 000000001 ?
                                                   000000000 ? 000000002 ?
ksedmp()+36          CALL     dbkedDefDump()       000000003 ? 000000002 ?
                                                   7FFF7312DDF8 ? 000000001 ?
                                                   000000000 ? 000000002 ?
ksfdmp()+64          CALL     ksedmp()             000000003 ? 000000002 ?
                                                   7FFF7312DDF8 ? 000000001 ?
                                                   000000000 ? 000000002 ?
dbgexPhaseII()+1764  CALL     ksfdmp()             000000003 ? 000000002 ?
                                                   7FFF7312DDF8 ? 000000001 ?
                                                   000000000 ? 000000002 ?
dbgexExplicitEndInc  CALL     dbgexPhaseII()       7F6850206710 ? 7F68502118C0 ?
()+750                                             7FFF73139930 ? 000000001 ?
                                                   000000000 ? 000000002 ?
dbgeEndDDEInvocatio  CALL     dbgexExplicitEndInc  7F6850206710 ? 7F68502118C0 ?
nImpl()+767                   ()                   7FFF73139930 ? 000000001 ?
                                                   000000000 ? 000000002 ?
dbgeEndDDEInvocatio  CALL     dbgeEndDDEInvocatio  7F6850206710 ? 7F68502118C0 ?
n()+47                        nImpl()              7FFF73139930 ? 000000001 ?
                                                   000000000 ? 000000002 ?
kcbl2vfyfh_action()  CALL     dbgeEndDDEInvocatio  7F6850206710 ? 7F68502118C0 ?
+3373                         n()                  7FFF73139930 ? 000000001 ?
                                                   000000000 ? 000000002 ?
ksbabs()+771         CALL     kcbl2vfyfh_action()  7FFF7313B168 ? 000000018 ?
                                                   7FFF73139930 ? 000000001 ?
                                                   000000000 ? 000000002 ?
ksbrdp()+971         CALL     ksbabs()             7FFF7313B168 ? 000000018 ?
                                                   7FFF73139930 ? 000000001 ?
                                                   000000000 ? 000000002 ?
opirip()+618         CALL     ksbrdp()             7FFF7313B168 ? 000000018 ?
                                                   7FFF73139930 ? 000000001 ?
                                                   000000000 ? 000000002 ?
opidrv()+598         CALL     opirip()             000000032 ? 000000004 ?
                                                   7FFF7313C328 ? 000000001 ?
                                                   000000000 ? 000000002 ?
sou2o()+98           CALL     opidrv()             000000032 ? 000000004 ?
                                                   7FFF7313C328 ? 000000001 ?
                                                   000000000 ? 000000002 ?
opimai_real()+261    CALL     sou2o()              7FFF7313C300 ? 000000032 ?
                                                   000000004 ? 7FFF7313C328 ?
                                                   000000000 ? 000000002 ?
ssthrdmain()+252     CALL     opimai_real()        000000000 ? 7FFF7313C4F0 ?
                                                   000000004 ? 7FFF7313C328 ?
                                                   000000000 ? 000000002 ?
main()+196           CALL     ssthrdmain()         000000003 ? 7FFF7313C4F0 ?
                                                   000000001 ? 000000000 ?
                                                   000000000 ? 000000002 ?
__libc_start_main()  CALL     main()               000000003 ? 7FFF7313C690 ?
+244                                               000000001 ? 000000000 ?
                                                   000000000 ? 000000002 ?
_start()+36          CALL     __libc_start_main()  000A0B188 ? 000000001 ?
                                                   7FFF7313C688 ? 000000000 ?
                                                   000000000 ? 000000003 ?
--------------------- Binary Stack Dump ---------------------

从实例2上的详细TRACE文件,可以很轻易的分析出导致这个错误的原因。FLASH CACHE参数期望当前是实例2,但是从FLASH CACHE配置的文件读取,发现文件中记录的是实例1,导致这个错误的原因在于FLASH CACHE配置的文件是各个实例独立的,因此不能配置一个共享的文件给两个实例同时使用。
这个问题还可以通过检查FLASH CACHE的统计信息发现:

SQL> SELECT * FROM v$sysstat WHERE name LIKE '%flash cache eviction%';
STATISTIC# NAME                                         CLASS      VALUE    STAT_ID
---------- --------------------------------------- ---------- ---------- ----------
        97 flash cache eviction: invalidated                8          1 3647825193
        98 flash cache eviction: buffer pinned              8          0 3134415242
        99 flash cache eviction: aged OUT                   8          0 3307935113

将FLASH CACHE的配置修改如下:

SQL> SHOW parameter flash
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_flash_cache_file                  string      +REDO/flash/abc.flash
db_flash_cache_size                  big INTEGER 60G
db_flashback_retention_target        INTEGER     1440
SQL> ALTER system SET db_flash_cache_size = 30g scope = spfile;
System altered.
SQL> ALTER system SET db_flash_cache_file = '+REDO/flash/odademo1.flash' scope = spfile sid = 'odaenmo1';
System altered.
SQL> ALTER system SET db_flash_cache_file = '+REDO/flash/odademo2.flash' scope = spfile sid = 'odaenmo2';
System altered.

设置正确的FLASH CACHE后,重启数据库,ORA-700的错误不再出现。

Posted in ORACLE | Tagged , , , , , , , , | Leave a comment

云和恩墨一周年年会

时间过得真快,转眼云和恩墨已经一岁了,今天公司全体员工在承德举行了年会。
从加入到恩墨科技到云和恩墨的成立,一直到今天公司的一周年的年会,公司从不到10人,发展到40人的队伍,可以说公司的成长还是非常快的。
快速的发展本身就可能会带来这样或那样的问题,更何况对于一个刚刚成立一年公司。这次年会除了总结去年业绩,制定新的战略目标外的一个主要目的就是希望发挥全体员工的智慧来发现公司制度上、策略上甚至是方向上的各种问题。
感谢所有员工去年一年的努力,也希望新的一年我们可以更加努力,实现个人和公司的共同快速成长。

Posted in NEWS | Leave a comment

ORA-4030、ORA-27300错误和大量ORA-7445错误

客户11.2 RAC环境出现大量的ORA-7445错误、ORA-4030错误以及ORA-27300、ORA-27301和ORA-27302错误。
错误信息为:

Sat DEC 31 02:33:18 2011
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_p015_28613.trc:
ORA-07445: exception encountered: core dump [_ndoprnt()+4] [SIGSEGV] [ADDR:0xFFFFFFFF7FFE5FE0] [PC:0xFFFFFFFF7ADA92B0] [Address NOT mapped TO object] []
ORA-04030: OUT OF process memory WHEN trying TO allocate 254488 bytes (QERGH hash-agg,kllcqas:kllsltba)
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_p018_28624.trc:
ORA-07445: exception encountered: core dump [_ndoprnt()+4] [SIGSEGV] [ADDR:0xFFFFFFFF7FFE5FE0] [PC:0xFFFFFFFF7ADA92B0] [Address NOT mapped TO object] []
ORA-04030: OUT OF process memory WHEN trying TO allocate 254488 bytes (QERGH hash-agg,kllcqas:kllsltba)
Exception [TYPE: SIGSEGV, Address NOT mapped TO object] [ADDR:0xFFFFFFFF7FFE5FE0] [PC:0xFFFFFFFF7ADA92B0, _ndoprnt()+4] [flags: 0x0, COUNT: 1]Exception [TYPE
: SIGSEGV, Address NOT mapped TO object] [ADDR:0xFFFFFFFF7FFE5EF0] [PC:0xFFFFFFFF7ADA92B0, _ndoprnt()+4] [flags: 0x0, COUNT: 1]
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_p019_28626.trc:
ORA-07445: exception encountered: core dump [_ndoprnt()+4] [SIGSEGV] [ADDR:0xFFFFFFFF7FFE5FE0] [PC:0xFFFFFFFF7ADA92B0] [Address NOT mapped TO object] []
ORA-04030: OUT OF process memory WHEN trying TO allocate 254488 bytes (QERGH hash-agg,kllcqas:kllsltba)
.
.
.
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_p017_28617.trc:
ORA-07445: exception encountered: core dump [_ndoprnt()+4] [SIGSEGV] [ADDR:0xFFFFFFFF7FFE5EF0] [PC:0xFFFFFFFF7ADA92B0] [Address NOT mapped TO object] []
ORA-04030: OUT OF process memory WHEN trying TO allocate 254488 bytes (QERGH hash-agg,kllcqas:kllsltba)
Sat DEC 31 02:33:21 2011
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_ora_26985.trc (incident=385310):
ORA-04030: 在尝试分配 123416 字节 (QERHJ hash-joi,kllcqas:kllsltba) 时进程内存不足
Incident details IN: /app/diag/rdbms/orcl/orcl1/incident/incdir_385310/orcl1_ora_26985_i385310.trc
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
Sat DEC 31 02:33:24 2011
Exception [TYPE: SIGSEGV, Address NOT mapped TO object] [ADDR:0xFFFFFFFF7FFE7088] [PC:0x10780A9C8, sskgds_init()+8] [flags: 0x0, COUNT: 1]
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_p014_28611.trc:
ORA-07445: exception encountered: core dump [sskgds_init()+8] [SIGSEGV] [ADDR:0xFFFFFFFF7FFE7088] [PC:0x10780A9C8] [Address NOT mapped TO object] []
ORA-04030: OUT OF process memory WHEN trying TO allocate 254488 bytes (QERGH hash-agg,kllcqas:kllsltba)
Sat DEC 31 02:33:24 2011
Exception [TYPE: SIGSEGV, Address NOT mapped TO object] [ADDR:0xFFFFFFFF7FFE7088] [PC:0x10780A9C8, sskgds_init()+8] [flags: 0x0, COUNT: 1]
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_p023_28634.trc:
ORA-07445: exception encountered: core dump [sskgds_init()+8] [SIGSEGV] [ADDR:0xFFFFFFFF7FFE7088] [PC:0x10780A9C8] [Address NOT mapped TO object] []
ORA-04030: OUT OF process memory WHEN trying TO allocate 254488 bytes (QERGH hash-agg,kllcqas:kllsltba)
Sat DEC 31 02:33:24 2011
Exception [TYPE: SIGSEGV, Address NOT mapped TO object] [ADDR:0xFFFFFFFF7FFE7088] [PC:0x10780A9C8, sskgds_init()+8] [flags: 0x0, COUNT: 1]
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_p020_28628.trc:
ORA-07445: exception encountered: core dump [sskgds_init()+8] [SIGSEGV] [ADDR:0xFFFFFFFF7FFE7088] [PC:0x10780A9C8] [Address NOT mapped TO object] []
ORA-04030: OUT OF process memory WHEN trying TO allocate 254488 bytes (QERGH hash-agg,kllcqas:kllsltba)
Sat DEC 31 02:33:24 2011
Exception [TYPE: SIGSEGV, Address NOT mapped TO object] [ADDR:0xFFFFFFFF7FFE6F98] [PC:0x10780A9C8, sskgds_init()+8] [flags: 0x0, COUNT: 1]
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_p016_28615.trc:
ORA-07445: exception encountered: core dump [sskgds_init()+8] [SIGSEGV] [ADDR:0xFFFFFFFF7FFE6F98] [PC:0x10780A9C8] [Address NOT mapped TO object] []
ORA-04030: OUT OF process memory WHEN trying TO allocate 254488 bytes (QERGH hash-agg,kllcqas:kllsltba)
.
.
.
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_p022_28632.trc:
ORA-07445: exception encountered: core dump [sskgds_init()+8] [SIGSEGV] [ADDR:0xFFFFFFFF7FFE6F98] [PC:0x10780A9C8] [Address NOT mapped TO object] []
ORA-04030: OUT OF process memory WHEN trying TO allocate 254488 bytes (QERGH hash-agg,kllcqas:kllsltba)
Exception [TYPE: SIGSEGV, Address NOT mapped TO object] [ADDR:0xFFFFFFFF7FFC6CF8] [PC:0x107388144, kgdsdst()+36] [flags: 0x0, COUNT: 1]
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_ora_26985.trc (incident=385311):
ORA-07445: 出现异常错误: 核心转储 [kgdsdst()+36] [SIGSEGV] [ADDR:0xFFFFFFFF7FFC6CF8] [PC:0x107388144] [Address NOT mapped TO object] []
ORA-04030: 在尝试分配 123416 字节 (QERHJ hash-joi,kllcqas:kllsltba) 时进程内存不足
Incident details IN: /app/diag/rdbms/orcl/orcl1/incident/incdir_385311/orcl1_ora_26985_i385311.trc
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
Sat DEC 31 02:33:32 2011
Dumping diagnostic DATA IN directory=[cdmp_20111231023332], requested BY (instance=1, osid=26985), summary=[incident=385310].
Sat DEC 31 02:33:37 2011
Process m000 died, see its trace file
Sat DEC 31 02:34:35 2011
Process startup failed, error stack:
Sat DEC 31 02:34:36 2011
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_ora_3787.trc (incident=386654):
ORA-04030: 在尝试分配 32792 字节 (QERHJ hash-joi,QERHJ Bit vector) 时进程内存不足
Incident details IN: /app/diag/rdbms/orcl/orcl1/incident/incdir_386654/orcl1_ora_3787_i386654.trc
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_psp0_46.trc:
ORA-27300: OS system dependent operation:fork failed WITH STATUS: 12
ORA-27301: OS failure message: NOT enough SPACE
ORA-27302: failure occurred at: skgpspawn3
Sat DEC 31 02:34:38 2011
Process m000 died, see its trace file
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
Process startup failed, error stack:
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_psp0_46.trc:
ORA-27300: OS system dependent operation:fork failed WITH STATUS: 12
ORA-27301: OS failure message: NOT enough SPACE
ORA-27302: failure occurred at: skgpspawn3
Sat DEC 31 02:34:43 2011
Process J000 died, see its trace file
kkjcre1p: unable TO spawn jobq slave process 
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_cjq0_2549.trc:
Process startup failed, error stack:
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_psp0_46.trc:
ORA-27300: OS system dependent operation:fork failed WITH STATUS: 12
ORA-27301: OS failure message: NOT enough SPACE
ORA-27302: failure occurred at: skgpspawn3
Process J000 died, see its trace file
kkjcre1p: unable TO spawn jobq slave process 
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_cjq0_2549.trc:
Sat DEC 31 02:34:47 2011
Exception [TYPE: SIGSEGV, Address NOT mapped TO object] [ADDR:0xFFFFFFFF7FFDE000] [PC:0x1008FD910, ksmpgd_init_ctx()+16] [flags: 0x0, COUNT: 1]
ERROR: Unable TO normalize symbol name FOR the following short stack (at offset 210):
dbgexProcessError()+164<-dbgePostErrorKGE()+1320<-dbkePostKGE_kgsf()+44<-kgeade()+416<-kgerelv()+144<-kgerev()+56<-kgerec5()+32<-ssexhd()+1448<-__sighndlr()+
12<-call_user_handler()+992<-ksmpgd_init_ctx()+16<-_$c1A.ksmpgd_scan_pga_heaps_local_detail()+24<-ksm_4030_dump_internal()+208<-ksmarfg()+912<-kghnospc()+429
2<-kghalf()+1416<-qerhjInitializeManagementComponents()+1688<-qerhjFetch()+996<-rwsfcd()+108<-qerhjFetch()+460<-opifch2()+10836<-kpoal8()+3228<-opiodr()+1428
<-ttcpip()+1056<-opitsk()+1528<-opiino()+1000<-opiodr()+1428<-opidrv()+1100<-sou2o()+92<-opimai_real()+304<-ssthrdmain()+320<-main()+308<-_start()+380
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_ora_3787.trc (incident=386655):
ORA-07445: 出现异常错误: 核心转储 [ksmpgd_init_ctx()+16] [SIGSEGV] [ADDR:0xFFFFFFFF7FFDE000] [PC:0x1008FD910] [Address NOT mapped TO object] []
ORA-04030: 在尝试分配 32792 字节 (QERHJ hash-joi,QERHJ Bit vector) 时进程内存不足
Incident details IN: /app/diag/rdbms/orcl/orcl1/incident/incdir_386655/orcl1_ora_3787_i386655.trc
USE ADRCI OR Support Workbench TO package the incident.
See Note 411.1 at My Oracle Support FOR error AND packaging details.
Sat DEC 31 02:34:53 2011
Process startup failed, error stack:
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_psp0_46.trc:
ORA-27300: OS system dependent operation:fork failed WITH STATUS: 12
ORA-27301: OS failure message: NOT enough SPACE
ORA-27302: failure occurred at: skgpspawn3
Process J000 died, see its trace file
kkjcre1p: unable TO spawn jobq slave process 
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_cjq0_2549.trc:
Process startup failed, error stack:
Errors IN file /app/diag/rdbms/orcl/orcl1/trace/orcl1_psp0_46.trc:
ORA-27300: OS system dependent operation:fork failed WITH STATUS: 12
ORA-27301: OS failure message: NOT enough SPACE
ORA-27302: failure occurred at: skgpspawn3
.
.
.

可以看到,整个过程数据库经历了四个7445错误分别为:ORA-7445[_ndoprnt]、ORA-7445[sskgds_init]、ORA-7445[kgdsdst]和ORA-7445[ksmpgd_init_ctx],而这四个ORA-7445显然都是ORA-4030错误导致的。而导致ORA-4030错误的原因,显然是随后操作系统错误ORA-27300、ORA-27301和ORA-27302错误导致的。
查询MOS发现和文档Database Crashes With ORA-04030 ORA-07445 ORA-27300 ORA-27301 ORA-27302 [ID 580552.1]描述的问题非常一致,而导致问题产生的原因和刚才分析的一样,由于系统SWAP空间不足导致了数据库的内存分配报错。
释放SWAP或/tmp下的空间,可以避免错误的出现。

Posted in BUG | Tagged , , , , , , , , , , , | Leave a comment

ASM实例出现ORA-4031错误导致实例崩溃

数据库的ASM实例出现了ORA-4031错误,导致了数据库实例的崩溃。
详细的错误信息为:

Wed Jul 25 08:03:45 2012
Errors IN file /u01/app/oracle/diag/rdbms/orcl/ORCL1/trace/ORCL1_asmb_34668994.trc (incident=808023):
ORA-04031: unable TO allocate 3432 bytes OF shared memory ("shared pool","unknown object","sga heap(1,0)","ASM file")
Incident details IN: /u01/app/oracle/diag/rdbms/orcl/ORCL1/incident/incdir_808023/ORCL1_asmb_34668994_i808023.trc
Wed Jul 25 08:03:47 2012
Trace dumping IS performing id=[cdmp_20120725080347]
Errors IN file /u01/app/oracle/diag/rdbms/orcl/ORCL1/trace/ORCL1_asmb_34668994.trc:
ORA-15064: communication failure WITH ASM instance
ORA-04031: unable TO allocate 3432 bytes OF shared memory ("shared pool","unknown object","sga heap(1,0)","ASM file")
ASMB (ospid: 34668994): terminating the instance due TO error 15064
Wed Jul 25 08:03:47 2012
ORA-1092 : opiodr aborting process UNKNOWN ospid (29819058_1)
Wed Jul 25 08:03:48 2012
Errors IN file /u01/app/oracle/diag/rdbms/orcl/ORCL1/trace/ORCL1_q002_25886884.trc:
ORA-15064: communication failure WITH ASM instance
Wed Jul 25 08:03:47 2012
Errors IN file /u01/app/oracle/diag/rdbms/orcl/ORCL1/trace/ORCL1_arc3_36503776.trc:
ORA-15064: communication failure WITH ASM instance
Wed Jul 25 08:03:47 2012
Errors IN file /u01/app/oracle/diag/rdbms/orcl/ORCL1/trace/ORCL1_q001_32374808.trc:
ORA-15064: communication failure WITH ASM instance
Errors IN file /u01/app/oracle/diag/rdbms/orcl/ORCL1/trace/ORCL1_arc3_36503776.trc:
ORA-15064: communication failure WITH ASM instance
Master archival failure: 15064
Wed Jul 25 08:03:48 2012
ORA-1092 : opiodr aborting process UNKNOWN ospid (12058986_1)
Wed Jul 25 08:03:49 2012
ORA-1092 : opitsk aborting process
Wed Jul 25 08:03:49 2012
License high water mark = 334
Instance TERMINATED BY ASMB, pid = 34668994
USER (ospid: 35914002): terminating the instance
Instance TERMINATED BY USER, pid = 35914002

显然数据库实例的出现的崩溃是由于ASM实例的错误导致的。根据Oracle的推荐,在11g中使用MEMORY_TARGET参数的话,那么MEMORY_TARGET至少应该分配256M以上,而且随着DISK GROUP的容量的增长,SHARED_POOL所需要的空间也随着增加。因此对于磁盘组空间比较大的ASM实例而言,采用默认参数作为MEMORY_TARGET显然是不够的。
对于当前环境,只需要简单的调整ASM实例的MEMORY_TARGET参数,将其扩大到500M以上,就可以避免该错误的产生。

Posted in BUG | Tagged , , , , , , , , | Leave a comment

数据泵导出出现ORA-4030错误

客户Oracle 10.2.0.3环境,在导出是碰到ORA-4030错误。
错误信息为:

Processing object TYPE SCHEMA_EXPORT/TABLE/CONSTRAINT/REF_CONSTRAINT
Processing object TYPE SCHEMA_EXPORT/TABLE/TRIGGER
Processing object TYPE SCHEMA_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
Processing object TYPE SCHEMA_EXPORT/POST_SCHEMA/PROCOBJ
ORA-39014: One OR more workers have prematurely exited.
ORA-39029: worker 1 WITH process name “DW01” prematurely TERMINATED
ORA-31671: Worker process DW01 had an unhandled exception.
ORA-04030: OUT OF process memory WHEN trying TO allocate 4108 bytes (PLS non-lib hp.pdzgM60_Make)
ORA-06512: at “SYS.KUPC$QUEUE_INT”, line 277
ORA-06512: at “SYS.KUPW$WORKER”, line 1311
ORA-04030: OUT OF process memory WHEN trying TO allocate 65036 bytes (kxs-heap-c,KQL tmpbuf) 
ORA-06512: at “SYS.KUPW$WORKER”, line 13249
ORA-06512: at “SYS.KUPW$WORKER”, line 14923
ORA-06512: at “SYS.KUPW$WORKER”, line 6293
ORA-39125: Worker unexpected fatal error IN KUPW$WORKER.DISPATCH_WORK_ITEMS while calling DBMS_METADATA.FETCH_XML_CLOB [PROCOBJ:“USER1”.“PROC_SCENAR2”]
ORA-06512: at “SYS.KUPW$WORKER”, line 6988
ORA-04030: OUT OF process memory WHEN trying TO allocate 4108 bytes (PLS non-lib hp.pdzgM60_Make)
ORA-04030: OUT OF process memory WHEN trying TO allocate 16396 bytes (koh-kghu sessi,pl/SQL vc2)
ORA-06512: at “SYS.KUPF$FILE”, line 2901
ORA-06512: at “SYS.KUPW$WORKER”, line 2301
ORA-04030: OUT OF process memory WHEN trying TO allocate 16396 bytes (koh-kghu sessi,pl/SQL vc2)
ORA-06512: at “SYS.KUPW$WORKER”, line 2
Job “USER1”.“SYS_EXPORT_SCHEMA_01” stopped due TO fatal error at 10:50:29

Oracle在导出数据字典元数据时出现了异常,这显然属于Oracle的bug。查询MOS,发现大量类似的描述,其中以DataPump Export (EXPDP) Terminates With Error ORA-4030 [ID 1368462.1]的描述最为接近。
Oracle给出了三种接近方案,可以设置隐含参数_use_realfree_heap=TRUE和 _realfree_heap_pagesize_hint=262144,这是避免PL/SQL程序导致ORA-4030错误的;另外就是在EXPDP的时候指定EXCLUDE=PROCACT_SYSTEM或EXCLUDE=PROCACT_SCHEMA,对当前的情况而言,这个方法应该是最适合的;最后一个方法是设置操作系统内核参数vm.max_map_count=300000。

Posted in BUG | Tagged , , , , , , , , , , , , , | Leave a comment

11g新增FLASH CACHE功能

在11.2中,Oracle提供了设置FLASH CACHE的功能,使得数据库可以自动使用配置好的FLASH CACHE作为二级缓存。
这个功能其实是EXADATA引入的,这也是EXADATA提高IO性能的又一利器。不过即使不是EXADATA,在11.2中也可以设置该功能,且这个功能的设置并不复杂。
11.2中提供了两个参数来设置FLASH CACHE:

SQL> SHOW parameter flash_cache
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_flash_cache_file                  string
db_flash_cache_size                  big INTEGER 0

其中DB_FLASH_CACHE_SIZE用来设置FLASH CACHE的大小,而DB_FLASH_CACHE_FILE设置文件的位置。
在操作系统上将FLASH CACHE挂成裸设备,然后添加到单独的ASM磁盘组中或直接挂载到操作系统上,然后通过DB_FLASH_CACHE_FILE指定ASM或操作系统目录下的文件既可。
需要注意的是,DB_FLASH_CACHE_FILE不像其他参数一样,对于ASM只需要指定磁盘组的名称既可,而必须通过手工的方式在ASM磁盘组上建立对应的目录,在设置参数的过程中,目录并不会自动创建,不过指定的文件名并不需要存在,Oracle会根据DB_FLASH_CACHE_SIZE的大小自动创建文件。

SQL> ALTER system SET db_flash_cache_size = 30g scope = spfile;
System altered.
SQL> ALTER system SET db_flash_cache_file = '+REDO/flash/odademo1.flash' scope = spfile sid = 'odaenmo1';
System altered.
SQL> ALTER system SET db_flash_cache_file = '+REDO/flash/odademo2.flash' scope = spfile sid = 'odaenmo2';
System altered.

需要注意,对于RAC而言,各个节点使用独立的FLASH CACHE文件,因此需要单独设置。

Posted in ORACLE | Tagged , , , | Leave a comment

ORA-600(ksqsgn:join)和ORA-7445(BREAKPOINT)错误

一个Windows环境下的RAC 10.2.0.4在添加节点时,原节点的ASM实例出现这两个错误。
错误信息如下:

Sat Jun 09 11:24:21 2012
 Starting ORACLE instance (normal)
 LICENSE_MAX_SESSION = 0
 LICENSE_SESSIONS_WARNING = 0
 Interface TYPE 1 GB2 192.168.7.0 configured FROM OCR FOR USE AS a cluster interconnect
 Interface TYPE 1 GB1 172.16.7.0 configured FROM OCR FOR USE AS a public interface
 Picked latch-free SCN scheme 3
 USING LOG_ARCHIVE_DEST_1 parameter DEFAULT VALUE AS C:\app\oracle\product\10.2.0\db_1\RDBMS
 Autotune OF undo retention IS turned off. 
LICENSE_MAX_USERS = 0
 SYS auditing IS disabled
 ksdpec: called FOR event 13740 prior TO event GROUP initialization
 Starting up ORACLE RDBMS Version: 10.2.0.4.0.
 System parameters WITH non-DEFAULT VALUES:
 large_pool_size = 12582912
 instance_type = asm
 cluster_database = TRUE
 instance_number = 1
 remote_login_passwordfile= EXCLUSIVE
 background_dump_dest = C:\APP\ORACLE\PRODUCT\10.2.0\ADMIN\+ASM\BDUMP
 user_dump_dest = C:\APP\ORACLE\PRODUCT\10.2.0\ADMIN\+ASM\UDUMP
 core_dump_dest = C:\APP\ORACLE\PRODUCT\10.2.0\ADMIN\+ASM\CDUMP
 ifile = C:\app\oracle\product\10.2.0\admin\+ASM\pfile\init.ora
 asm_diskgroups = DATA
 Cluster communication IS configured TO USE the following interface(s) FOR this instance
 172.16.7.213
 Sat Jun 09 11:24:21 2012
 cluster interconnect IPC version:Oracle 9i Winsock2 TCP/IP IPC
 IPC Vendor 0 proto 0
 Version 0.0
 PMON started WITH pid=2, OS id=6744
 DIAG started WITH pid=3, OS id=1504
 PSP0 started WITH pid=4, OS id=4216
 LMON started WITH pid=5, OS id=5304
 LMD0 started WITH pid=6, OS id=4164
 LMS0 started WITH pid=7, OS id=4132
 MMAN started WITH pid=8, OS id=5868
 DBW0 started WITH pid=9, OS id=5412
 LGWR started WITH pid=10, OS id=4212
 CKPT started WITH pid=11, OS id=7132
 SMON started WITH pid=12, OS id=6192
 RBAL started WITH pid=13, OS id=5924
 GMON started WITH pid=14, OS id=6900
 Sat Jun 09 11:24:47 2012
 lmon registered WITH NM - instance id 1 (internal mem no 0)
 Sat Jun 09 11:26:24 2012
 Error: KGXGN polling error (15)
 Sat Jun 09 11:26:24 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_lmon_5304.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
LMON: terminating instance due TO error 29702
 Sat Jun 09 11:26:24 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\udump\+asm1_ora_3564.trc:
 ORA-00600: internal error code, arguments: [ksqsgn:JOIN], [error IN lmon process], [32], [], [], [], [], []
 
Sat Jun 09 11:26:24 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_gmon_6900.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:24 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_smon_6192.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:24 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_ckpt_7132.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:24 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_lgwr_4212.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:24 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_dbw0_5412.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:25 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_lmd0_4164.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:25 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_lms0_4132.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:25 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_mman_5868.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:25 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_psp0_4216.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:25 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_pmon_6744.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:25 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_rbal_5924.trc:
 ORA-29702: error occurred IN Cluster GROUP Service operation
 
Sat Jun 09 11:26:38 2012
 Errors IN file c:\app\oracle\product\10.2.0\admin\+asm\bdump\+asm1_diag_1504.trc:
 ORA-07445: exception encountered: core dump [BREAKPOINT] [unable_to_trans_pc] [PC:0x774F4EA0] [] [] []
 
Sat Jun 09 11:26:39 2012
 Instance TERMINATED BY LMON, pid = 5304

可以看到除了上述的ORA-600[ksqsgn:join]和ORA-7445[BREAKPOINT]错误外,还有一些明显的错误信息,包括ORA-29702、Error: KGXGN polling error (15)和Instance terminated by LMON等。

首先分析ORA-600错误,这个错误在metalink上都没有记录,虽然Windows环境下的RAC并不多见,但是如果这个错误在MOS中一个相关的记录都找不到,这种可能性并不太大。导致这种情况发生的原因一般不是当前版本或操作系统比较特殊,就是犯了一些低级的错误导致了一些平时很难发生的错误。如果根据错误函数名称和报错的第二参数,这个错误应该和之前的ORA-29702错误有直接的关系。

再看ORA-7445错误,这个错误还真找到一个相关度很高的bug:ORA-07445: Exception Encountered: Core Dump [Breakpoint] [Unable_to_trans_pc] [ID 1214796.1]。不过这个问题已经在Windows下的10.2.0.2 PATCH 9中进行了修复,理论上讲在10.2.0.4中不应该出现这个错误。

不过从7445错误的日志文件中找到了下列的信息:

*** 2012-06-09 11:24:47.376
 IPCConnect: unable TO CONNECT TO addr [+AS : 3420 : 2720 : 2314930], err 258
 IPCGetRequestInfo: failed a request rqh(0xd234bc0), TYPE(1), STATUS(2), bytes(0)
 Target port [node=1] IS no longer valid
 kjzgmapbcast:error encounter WHEN broadcasting

而前面分析了导致ORA-600错误的第一个ORA-29702错误的TRACE文件中,同样包含了相似的内容:

*** 2012-06-09 11:24:47.345
 IPCConnect: unable TO CONNECT TO addr [+AS : 3420 : 3212 : 2314930], err 258
 IPCGetRequestInfo: failed a request rqh(0x126b2da0), TYPE(1), STATUS(2), bytes(0)
 kjfcpiora: published my fusion master weight 5418
 kjfcpiora: publish my flogb 9
 kjxggpoll: CHANGE poll TIME TO 50 ms

这说明节点间的IPCS通信出现了异常。

再次检查上面的告警信息,发现了问题的根源,在启动ASM实例之前,Oracle根据CLUSTER的配置确定了192地址段为CLUSTER INTERCONNECT,而172地址段为PUBLIC INTERFACE,但是在实例启动后,确将172地址段作为CLUSTER通信的地址段,并最终导致了ORA-29702错误以及Error: KGXGN polling error (15)和CLUSTER通信超时有关的错误。

Windows上更改CLUSTER和PUBLIC网卡的配置需要重启生效,系统重启后,确认PRIVATE IP和PUBLIC IP设置无误后,启动CLUSTER和DB,问题消失。

Posted in ORACLE | Tagged , , , , , , , , , , , | Leave a comment