首页 > 数据库开发 > 一次oracle数据库数据块损坏的处理过程

一次oracle数据库数据块损坏的处理过程

1.1    报错ORA-600 [kcratr_nab_less_than_odr],不能启动

1,现象描述:服务器存储断电,导致数据库down机,再次尝试启动数据库,数据库不能正常启动,数据库报错如下

[sql] view plaincopy在CODE上查看代码片派生到我的代码片

  1. SQL> alter database open;
  2. alter database open
  3. *
  4. ERROR at line 1:
  5. ORA-00600: internal error code, arguments:[kcratr_nab_less_than_odr], [1], [2022], [523240], [523468], [], [], [], [],[], [], []

2,查看alert日志,数据库启动时,已开始日志恢复,但恢复时报错ORA-600,除了生成一个trc文件外,没有更详细的信息

[sql] view plaincopy在CODE上查看代码片派生到我的代码片

  1. Thu Feb 06 23:13:19 2014
  2. ALTER DATABASE OPEN
  3. Beginning crash recovery of 1 threads
  4.  parallel recovery started with 3 processes
  5. Started redo scan
  6. Completed redo scan
  7.  read1580 KB redo, 211 data blocks need recovery
  8. Errors in file/app/oracle/diag/rdbms/epm/epm1/trace/epm1_ora_4679.trc  (incident=375883):
  9. ORA-00600: internal error code, arguments:[kcratr_nab_less_than_odr], [1], [2022], [523240], [523468], [], [], [], [],[], [], []
  10. Incident details in:/app/oracle/diag/rdbms/epm/epm1/incident/incdir_375883/epm1_ora_4679_i375883.trc
  11. Aborting crash recovery due to error 600
  12. Errors in file/app/oracle/diag/rdbms/epm/epm1/trace/epm1_ora_4679.trc:
  13. ORA-00600: internal error code, arguments:[kcratr_nab_less_than_odr], [1], [2022], [523240], [523468], [], [], [], [],[], [], []
  14. Errors in file/app/oracle/diag/rdbms/epm/epm1/trace/epm1_ora_4679.trc:
  15. ORA-00600: internal error code, arguments:[kcratr_nab_less_than_odr], [1], [2022], [523240], [523468], [], [], [], [],[], [], []
  16. ORA-600 signalled during: ALTER DATABASEOPEN...
  17. Trace dumping is performingid=[cdmp_20140206231320]

3,查看trc文件,通过WARNING可以看出,数据库恢复时认为seq2022(对应日志文件)中的记录,应该恢复到523240,但是恢复到523468却被迫停止了。应该是控制文件和日志文件记录不完全一致导致,注意这几个数据正好对应到kcratr_nab_less_than_odr的几个参数:[kcratr_nab_less_than_odr],[1], [2022], [523240], [523468]

 

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. *** 2014-02-06 23:37:11.872
  2. Successfully allocated 3 recovery slaves
  3. Using 45 overflow buffers per recoveryslave
  4. Thread 1 checkpoint: logseq 2022, block 2,scn 9187726619852
  5.  cache-low rba: logseq 2022, block 520079
  6.    on-disk rba: logseq 2022, block 523468, scn 9187726672777
  7.  start recovery at logseq 2022, block 520079, scn 0
  8. *** 2014-02-06 23:37:11.890
  9. Started writing zeroblks thread 1 seq 2022blocks 523240-523247
  10. ----------------------------------------------
  11. WARNING! Crash recovery of thread 1 seq2022 is
  12. ending at redo block 523240 but should nothave ended before
  13. redo block 523468
  14. Incident 375884 created, dump file:/app/oracle/diag/rdbms/epm/epm1/incident/incdir_375884/epm1_ora_4971_i375884.trc
  15. ORA-00600: internal error code, arguments:[kcratr_nab_less_than_odr], [1], [2022], [523240], [523468], [], [], [], [],[], [], []
  16. ORA-00600: internal error code, arguments:[kcratr_nab_less_than_odr], [1], [2022], [523240], [523468], [], [], [], [],[], [], []
  17. ORA-00600: internal error code, arguments:[kcratr_nab_less_than_odr], [1], [2022], [523240], [523468], [], [], [], [],[], [], []

 

4,查询metalink,已有相关案例,主要是由于存储突然断电,导致向在线日志文件中写日志失败。

This Problem is caused by Storage Problemof the Database Files. The Subsystem (eg. SAN) crashed while the Database wasopen. The Database then crashed since the Database Files were not accessibleanymore. This caused a lost Write into the Online RedoLogs and so InstanceRecovery is not possible and raising the ORA-600.

官网中提供的解决办法是通过备份的控制文件恢复数据库,然后resetlog启动,当然如果没有备份,也可以通过参考文件中重建控制文件的方法。

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. 1. If you could restore your StorageEnvironment and the Online RedoLogs from the Time of the crash you can try amanual Recovery followed by a RESETLOGS:
  2.  SQL> startup mount;
  3.  SQL> recover database until cancel usingbackup controlfile;
  4.  -> manually provide Online RedoLogcontaining the last (current) Sequence when asked, eg.
  5.  ORA-00279: change 100000 generated atxx/xx/xxxx xx:xx:xx needed for thread 1
  6. ORA-00289: suggestion :
  7. /flash_recovery/archivelog/xxxx_xx_xx/o1_mf_1_100_%u_.arc
  8. ORA-00280: change 100000 for thread 1 is insequence #100
  9. Specify log: {<RET>=suggested |filename | AUTO | CANCEL}
  10. /ora/oradata/dbtest/redo04_1.rdo
  11. Log applied.
  12. Media recovery complete.
  13. SQL> alter database open resetlogs;

 

5,通过metalink上的解决方法,根据提示输入在线日志文件名,数据库却再次报错ORA-00600 [2662] [2139]: internal errorcode, arguments: [2662], [2139], [791626372], [2139],

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. [oracle@localhost oradata]$ sqlplus"/as sysdba"
  2.  SQL*Plus: Release 11.2.0.1.0 Production onThu Feb 6 17:53:47 2014
  3.  Copyright (c) 1982, 2009, Oracle.  All rights reserved.
  4. Connected to an idle instance.
  5. SQL> startup mount;
  6. ORACLE instance started.
  7.  Total System Global Area 6714322944 bytes
  8. Fixed Size                  2226056 bytes
  9. Variable Size            5033166968 bytes
  10. Database Buffers         1660944384 bytes
  11. Redo Buffers               17985536 bytes
  12. Database mounted.
  13. SQL> recover database until cancel usingbackup controlfile;
  14. ORA-00279: change 9187726668895 generatedat 02/05/2014 01:00:04 needed for
  15. thread 1
  16. ORA-00289: suggestion :/app/archive_log/1_2022_804560942.dbf
  17. ORA-00280: change 9187726668895 for thread1 is in sequence #2022
  18. Specify log: {<RET>=suggested |filename | AUTO | CANCEL}
  19. /app/oracle/oradata/epm/redo03.log
  20. Log applied.
  21. Media recovery complete.
  22. SQL> alter database open resetlogs;
  23. alter database open resetlogs
  24. *
  25. ERROR at line 1:
  26. ORA-00603: ORACLE server session terminatedby fatal error
  27. ORA-00600: internal error code, arguments:[2662], [2139], [791626372], [2139],
  28. [791626610], [12583120], [], [], [], [],[], []
  29. ORA-00600: internal error code, arguments:[2662], [2139], [791626371], [2139],
  30. [791626610], [12583120], [], [], [], [],[], []
  31. ORA-01092: ORACLE instance terminated.Disconnection forced
  32. ORA-00600: internal error code, arguments:[2662], [2139], [791626368], [2139],
  33. [791626610], [12583120], [], [], [], [],[], []
  34. Process ID: 9658
  35. Session ID: 1705 Serial number: 5

 

6,再次查看alert日志,报错基本和上述相同,并提示,生成了一个trc文件,网上查询,确定为resetlogs仍然存在数据库坏块

[html] view plaincopy在CODE上查看代码片派生到我的代码片

  1. [oracle@localhost trace]$ tail -falert_epm1.log
  2. ORA-00600: internal error code, arguments:[2662], [2139], [791626371], [2139], [791626610], [12583120], [], [], [], [],[], []
  3. ORA-01092: ORACLE instance terminated.Disconnection forced
  4. ORA-00600: internal error code, arguments:[2662], [2139], [791626368], [2139], [791626610], [12583120], [], [], [], [],[], []
  5. Incident details in:/app/oracle/diag/rdbms/epm/epm1/incident/incdir_411735/epm1_ora_9658_i411735.trc
  6. Errors in file/app/oracle/diag/rdbms/epm/epm1/incident/incdir_411735/epm1_ora_9658_i411735.trc:
  7. ORA-00603: ORACLE server session terminatedby fatal error
  8. ORA-00600: internal error code, arguments:[2662], [2139], [791626372], [2139], [791626610], [12583120], [], [], [], [],[], []
  9. ORA-00600: internal error code, arguments:[2662], [2139], [791626371], [2139], [791626610], [12583120], [], [], [], [],[], []
  10. ORA-01092: ORACLE instance terminated.Disconnection forced
  11. ORA-00600: internal error code, arguments:[2662], [2139], [791626368], [2139], [791626610], [12583120], [], [], [], [],[], []

7.重启启动数据库,数据库已能正常启动,但启动后很快就down掉,

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. SQL> startup mount;
  2. ORACLE instance started.
  3. Total System Global Area 6714322944 bytes
  4. Fixed Size                  2226056 bytes
  5. Variable Size            5033166968 bytes
  6. Database Buffers         1660944384 bytes
  7. RedoBuffers               17985536 bytes
  8. Database mounted.
  9. SQL> alter database open;
  10. Database altered.
  11. SQL> select * from v$instance;
  12. select * from v$instance
  13. *
  14. ERROR at line 1:
  15. ORA-03135: connection lost contact
  16. Process ID: 10171
  17. Session ID: 1705 Serial number: 5

 

1.2    启动后down,报错ORA-600 [4194]

8,查看日志,日志报错,ORA-01595,ora-600 [4194]

Block recovery completed at rba 2.137.16,scn 2139.791646684

Errors in file/app/oracle/diag/rdbms/epm/epm1/trace/epm1_smon_10046.trc:

ORA-01595: error freeing extent (49) ofrollback segment (6))

ORA-00600: internal error code, arguments:[4194], [], [], [], [], [], [], [], [], [], [], []

 

9,继续查询metalink,关于ora-600[4194],文章ID 1428786.1有详细讲解,原因是由于断电或者硬件故障,数据库实例恢复rollback时报错,

 

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. Symptoms
  2. The following error is occurring in thealert.log right before the database crashes.
  3. ORA-00600: internal error code, arguments:[4194], [#], [#], [], [], [], [], []
  4. This error indicates that a mismatch hasbeen detected between redo records and rollback (undo) records.
  5. ARGUMENTS:
  6. Arg [a] - Maximum Undo record number inUndo block
  7. Arg [b] - Undo record number from Redoblock
  8.  Since we are adding a new undo record toour undo block, we would expect that the new record number is equal to themaximum record number in the undo block plus one. Before Oracle can add a newundo record to the undo block it validates that this is correct. If thisvalidation fails, then an ORA-600 [4194] will be triggered.
  9. Changes
  10. This issue generally occurs when there is apower outage or hardware failure that initially crashes the database. Onstartup, the database does the normal roll forward (redo) and then rollback(undo), this is where the error is generated on the rollback.

 

10,metalink的解决方案是重建undo表空间,基本思路是先把undo设置为manual管理方式,重建undo,然后重启库即可

 

SQL> Create pfile='/tmp/corrupt.ora'from spfile ;

vi /tmp/corrupt.ora

*.Undo_management=Manual

启动数据库到mount状态

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. SQL> Startup mountpfile='/tmp/corrupt.ora'
  2. ORACLE instance started.
  3. Total System Global Area 6714322944 bytes
  4. Fixed Size                  2226056 bytes
  5. Variable Size            5033166968 bytes
  6. Database Buffers         1660944384 bytes
  7. Redo Buffers               17985536 bytes
  8. Database mounted.
  9. SQL> Show parameter undo
  10. NAME                                 TYPE        VALUE
  11. ----------------------------------------------- ------------------------------
  12. undo_management                      string      MANUAL
  13. undo_retention                       integer     10800
  14. undo_tablespace                      string      UNDOTBS1
  15. SQL> Alter database open ;
  16. Database altered.
  17. SQL> Create rollback segment r01 ;
  18. Rollback segment created.
  19. SQL> Alter rollback segment r01 online ;
  20.  Rollback segment altered.
  21. SQL> Create undo tablespace undotbs_2datafile '/app/oracle/oradata/epm/undotbs_2.dbf' size 200M ;
  22. Tablespace created.
  23.  SQL> alter system set undo_tablespace ='undotbs_2' scope=spfile;
  24.  System altered.

再次重启数据库,此时也能正确启动

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. SQL> startup
  2. ORACLE instance started.
  3.  Total System Global Area 6714322944 bytes
  4. Fixed Size                  2226056 bytes
  5. Variable Size            5033166968 bytes
  6. Database Buffers         1660944384 bytes
  7. Redo Buffers               17985536 bytes
  8. Database mounted.
  9. Database opened.

1.3    数据库坏块的处理

11,此时,查看数据库日志,有数据库坏块方面的报错,虽然只有只报出一个数据库坏块,实际上可能存在更多的坏块

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. Thu Feb 06 18:52:57 2014
  2. Errors in file/app/oracle/diag/rdbms/epm/epm1/trace/epm1_j000_13117.trc  (incident=460023):
  3. ORA-01578: ORACLE data block corrupted(file # 2, block # 31061)
  4. ORA-01110: data file 2:'/app/oracle/oradata/epm/sysaux01.dbf'
  5. Errors in file /app/oracle/diag/rdbms/epm/epm1/trace/epm1_j000_13117.trc  (incident=460024):
  6. ORA-01578: ORACLE data block corrupted(file # 2, block # 31061)
  7. ORA-01110: data file 2:'/app/oracle/oradata/epm/sysaux01.dbf'
  8. Errors in file/app/oracle/diag/rdbms/epm/epm1/trace/epm1_j000_13117.trc  (incident=460025):
  9. ORA-01578: ORACLE data block corrupted(file # , block # )
  10. ORA-01578: ORACLE data block corrupted(file # 2, block # 31061)
  11. ORA-01110: data file 2:'/app/oracle/oradata/epm/sysaux01.dbf

12,验证数据库坏块,常用的有两种方法,两种方法本质上是一致的,本文采用的是rman命令

1),rman命令

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. run {
  2. allocate channel d1 type disk;
  3. allocate channel d2 type disk;
  4. backup validate check logical database;
  5. }

 

验证某个文件

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. RMAN> backup validate datafile 4;
  2. Starting backup at 06-FEB-14
  3. using channel ORA_DISK_1
  4. channel ORA_DISK_1: starting full datafilebackup set
  5. channel ORA_DISK_1: specifying datafile(s)in backup set
  6. input datafile file number=00004name=/app/oracle/oradata/epm/users01.dbf
  7. channel ORA_DISK_1: backup set complete,elapsed time: 00:04:25
  8. List of Datafiles
  9. =================
  10. File Status Marked Corrupt Empty BlocksBlocks Examined High SCN
  11. ---- ------ -------------- --------------------------- ----------
  12. 4   FAILED 0              46754        477122          9187726907487
  13.  File Name: /app/oracle/oradata/epm/users01.dbf
  14.  Block Type Blocks Failing Blocks Processed
  15.  ---------- -------------- ----------------
  16.  Data       0              292182
  17.  Index      0              45604
  18. <span style="color:#FF0000;">
  19.  Other      2              92580     </span>
  20. validate found one or more corrupt blocks
  21. See trace file/app/oracle/diag/rdbms/epm/epm1/trace/epm1_ora_13757.trc for details
  22. Finished backup at 06-FEB-14

 

2),dbv命令

验证某个数据库文件

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. dbvFILE=/app/oracle/oradata/epm/users01.dbf
  2. Page 439168 is influx - most likely media corrupt
  3. Corrupt block relative dba: 0x0106b380(file 4, block 439168)
  4. Fractured block found during dbv:
  5. Data in bad block:
  6.  type: 32 format: 2 rdba: 0x0106b380
  7.  lastchange scn: 0x085b.2ef4cb1d seq: 0xe flg: 0x04
  8.  spare1: 0x0 spare2: 0x0 spare3: 0x0
  9.  consistency value in tail: 0x72a7201b
  10.  check value in block header: 0xfd1d
  11.  computed block checksum: 0xb9af
  12. Page 439296 is influx - most likely mediacorrupt
  13. Corrupt block relative dba: 0x0106b400(file 4, block 439296)
  14. Fractured block found during dbv:
  15. Data in bad block:
  16.  type: 32 format: 2 rdba: 0x0106b400
  17.  lastchange scn: 0x085b.2ef4cb24 seq: 0x19 flg: 0x04
  18.  spare1: 0x0 spare2: 0x0 spare3: 0x0
  19.  consistency value in tail: 0x72ac2025
  20.  check value in block header: 0xfd1f
  21.  computed block checksum: 0xb9b4
  22. DBVERIFY - Verification complete
  23. Total Pages Examined         : 477120
  24. Total Pages Processed (Data) : 292182
  25. Total Pages Failing   (Data) : 0
  26. Total Pages Processed (Index): 45604
  27. Total Pages Failing   (Index): 0
  28. Total Pages Processed (Other): 92578
  29. Total Pages Processed (Seg)  : 0
  30. Total Pages Failing   (Seg) : 0
  31. Total Pages Empty            : 46754
  32. <span style="color:#FF0000;">
  33. Total Pages Marked Corrupt   : 2
  34. </span>
  35. Total Pages Influx           : 2
  36. Total Pages Encrypted        : 0
  37. Highest block SCN            : 791817065 (2139.791817065)

 

13,此时,全库逻辑导出,保护好现有的数据

expdp userid=\"/ as sysdba\"full=y dumpfile=epm20140206_3.dmp directory=dpdata1  LOGFILE=epm20140206_3.log  PARALLEL=3

14,通过rman命令,共检验出22个数据库坏块

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. SQL> select * fromv$database_block_corruption;
  2.      FILE#     BLOCK#     BLOCKS CORRUPTION_CHANGE# CORRUPTIO
  3. ---------- ---------- ---------------------------- ---------
  4.         4     439296          1                  0 FRACTURED
  5.         4     439168          1                  0 FRACTURED
  6.         8    2610431          1                  0 FRACTURED
  7.         8    2547178          1                  0 FRACTURED
  8.         8    2547114          1                  0 FRACTURED
  9.         8    2547050          1                  0 FRACTURED
  10.         8    2546986          1                  0 FRACTURED
  11.         8    2546922          1                  0 FRACTURED
  12.         8    2546890          1                  0 FRACTURED
  13.         8    2546858          1                  0 FRACTURED
  14.         8    2546826          1                  0 FRACTURED
  15.       FILE#    BLOCK#     BLOCKSCORRUPTION_CHANGE# CORRUPTIO
  16. ---------- ---------- ---------------------------- ---------
  17.         8    2546794          1                  0 FRACTURED
  18.         8    2546762          1                  0 FRACTURED
  19.         8    2546730          1                  0 FRACTURED
  20.         8    2546698          1                  0 FRACTURED
  21.         8    2459433          1                  0 FRACTURED
  22.         8    2459305          1                  0 FRACTURED
  23.         8    1596687          1                  0 FRACTURED
  24.         9     876808          1                  0 FRACTURED
  25.         9     662038          1                  0 FRACTURED
  26.         9     345491          1                  0 FRACTURED
  27.         9     281617          1                  0 FRACTURED

 

15,进一步定位到数据库对象

查看坏块对应的对象

 

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. Selecttablespace_name,segment_type,owner,segment_name From dba_extents Wherefile_id=2 and 31061 between block_id and block_id+blocks-1;

 

 

根据坏块生产查看对象的sql

 

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. select 'Select tablespace_name,segment_type,owner,segment_nameFrom dba_extents Where file_id=' || FILE# || ' and ' || BLOCK# ||  ' between block_id and block_id+blocks-1;'from v$database_block_corruption;

 

16,对于索引坏块,处理比较简单,直接rebuild即可,如果索引坏块较多,可以选择重建所有索引

重建索引

SQL> alter INDEX EPM.IDX_QRTZ_T_NEXT_FIRE_TIMErebuild online;

 

对于表比较复杂

1),可以通过备份恢复数据块,本例由于备份有问题,未成功

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. RMAN> blockrecover datafile 2 block31061 from backupset;
  2. Starting recover at 06-FEB-14
  3. using channel ORA_DISK_1
  4. channel ORA_DISK_1: restoring block(s)
  5. channel ORA_DISK_1: specifying block(s) torestore from backup set
  6. restoring blocks of datafile 00002
  7. channel ORA_DISK_1: reading from backuppiece /app/backup/rman_bak/rman.c0_20140122_1629_1_EPM.bak
  8. channel ORA_DISK_1: ORA-19870: error whilerestoring backup piece /app/backup/rman_bak/rman.c0_20140122_1629_1_EPM.bak
  9. ORA-19501: read error on file"/app/backup/rman_bak/rman.c0_20140122_1629_1_EPM.bak", block number154496 (block size=8192)
  10. ORA-27061: waiting for async I/Os failed
  11. Linux-x86_64 Error: 5: Input/output error
  12. Additional information: -1
  13. Additional information: 1048576

 

2),如果没有备份,则可以通过如下方法跳过坏块,当然会丢失部分数据

[plain] view plaincopy在CODE上查看代码片派生到我的代码片

  1. SQL> ALTER SESSION SET EVENTS
  2.  2  '10231 TRACE NAME CONTEXTFOREVER, LEVEL 10';
  3.  Session altered.
  4.  SQL> create tableEPM.REQ_RESPSB_SUPPLIER_BAK as select * from EPM.REQ_RESPSB_SUPPLIER;
  5.  Table created.
  6.  SQL> select count(*) fromEPM.REQ_RESPSB_SUPPLIER_BAK;
  7.   COUNT(*)
  8. ----------
  9.    188786

 

17,至此,数据库启动成功,业务也基本恢复正常了

 

参考文档

metalink:ORA-600 [kcratr_nab_less_than_odr]during Instance Recovery after Database Crash (Doc ID 1299564.1)

metalink:Step by step to resolve ORA-6004194 4193 4197 on database crash (Doc ID 1428786.1)

http://www.eygle.com/archives/2010/05/ora-00600_kcratr1_lostwrt.html

http://www.xifenfei.com/2347.html

http://www.askmaclean.com/archives/ora-6004194%E9%94%99%E8%AF%AF%E4%B8%80%E4%BE%8B.html


本文固定链接: http://www.devba.com/index.php/archives/4819.html | 开发吧

报歉!评论已关闭.