前面文章提到,这周末帮一个客户测试报错场景:
客户通过duplicate生产备库的方式创建cascade备库。
发现每次都会遇到两个文件报错,ORA-17628: Oracle error 19505错误,且每一次跑,报错文件不一样。 现在想帮客户验证,这属于是正常现象还是bug。
One data file is not using OMF name while the rest of the data files are using OMF name.
怀疑是对应数据文件名字的问题,但后来排除了此原因。
因为比对报错数据文件和正常的数据文件并无特殊之处,且每次报错文件并不一样。
另外一篇文档是:
RMAN-03009: failure of backup command on ORA_DISK_1 channel at 03/28/2011 19:17:43
ORA-17628: Oracle error 19505 returned by remote Oracle server
这篇文档比较全,把已知问题枚举了下,可作为参考。
得到的启示是,不但要看duplicate的日志,还要去看auxiliary数据库的alert日志,有可能会有更详细的日志描述。
另外我重新审视,增加关键字,还匹配到更精确的一篇文档:
结论是说目标路径有重复文件,这...其实上篇文档也有提到。
另外,1401333.1也说了,
这个错误其实是很通用的一个错误号,如果已知问题都不能匹配,那么就需要更多的诊断来找到确切的原因。
也就是又强调了下,我们还要去看比如auxiliary数据库的alert日志等详细信息。
下面按客户要求,想测试当只有部分几个文件报错,是否可以单独备份过去?
这其实算workaround的范畴,我们通常不建议这样来解决一个实际问题。
但客户实在是不想反复重跑这个duplicate,因为库太大了,老这么重跑时间上确实有些伤不起...
所以本着客户关怀角度,也就协助测试下这个可能性:因为这种问题确实不好直接回答是或否,只能构造场景实际测试下:
首先从一级备库duplicate到二级备库会报错:
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 05/14/2023 23:54:15
RMAN-05501: aborting duplication of target database
RMAN-05531: a mounted database cannot be duplicated while datafiles are fuzzy
先改为从主库duplicate到二级备库,重新跑脚本。
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 05/14/2023 23:56:57
RMAN-05501: aborting duplication of target database
RMAN-03015: error occurred in stored script Memory Script
RMAN-03009: failure of backup command on prmy1 channel at 05/14/2023 23:56:57
ORA-17628: Oracle error 19505 returned by remote Oracle server
看看!这里有同样的错误号,但是并不是复现问题,因为这个错误号有太多的可能性。
此时正好按上面文档习得的经验,去查看二级备库(auxiliary)的alert日志有没有更详细的信息:
Sun May 14 23:56:56 2023
Errors in file /u01/app/oracle/diag/rdbms/jycas/jingyu/trace/jingyu_ora_6028.trc:
ORA-19505: failed to identify file "/u01/oradata/jingyu/control01.ctl"
ORA-27040: file create error, unable to create file
Linux-x86_64 Error: 2: No such file or directory
Additional information: 1
Sun May 14 23:57:41 2023
destination database instance is 'started' not 'mounted'
果然,是因为一个简单的文件创建失败的错误,应该是对应的文件夹不存在,直接创建即可:
[oracle@db11gcas ~]$ cd /u01/oradata/jingyu
-bash: cd: /u01/oradata/jingyu: No such file or directory
[oracle@db11gcas ~]$ cd /u01/oradata
[oracle@db11gcas oradata]$ ls -ld
drwxr-xr-x. 2 oracle oinstall 6 May 14 23:49 .
[oracle@db11gcas oradata]$ mkdir jingyu
[oracle@db11gcas oradata]$ cd /u01/oradata/jingyu
[oracle@db11gcas jingyu]$
创建对应的文件夹后再次跑脚本成功。
但是为了构造客户的问题,我这里将成功的文件故意删除掉1个。
...
channel prmy1: starting datafile copy
input datafile file number=00004 name=/u01/oradata/jingyu/users01.dbf
output file name=/u01/oradata/JYCAS/datafile/o1_mf_dbs_i_ji_1r1s5uap_.dbf tag=TAG20230515T000118
channel prmy2: datafile copy complete, elapsed time: 00:00:01
output file name=/u01/oradata/JYCAS/datafile/o1_mf_t2_1s1s5uap_.dbf tag=TAG20230515T000118
channel prmy3: datafile copy complete, elapsed time: 00:00:01
output file name=/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf tag=TAG20230515T000118
channel prmy1: datafile copy complete, elapsed time: 00:00:01
Finished backup at 2023-05-15 00:01:31
...
就删除上面这个4号数据文件吧!删除之后肯定数据库是无法open的。
/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf
[oracle@db11gcas datafile]$ rm o1_mf_users_1t1s5uaq_.dbf
SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-10458: standby database requires recovery
ORA-01157: cannot identify/lock data file 4 - see DBWR trace file
ORA-01110: data file 4: '/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf'
而且,这个时候开启mrp进程,也是不成功的:
SQL> recover managed standby database disconnect;
Media recovery complete.
SQL> !ps -ef|grep mrp
oracle 7079 6943 0 00:15 pts/1 00:00:00 /bin/bash -c ps -ef|grep mrp
oracle 7081 7079 0 00:15 pts/1 00:00:00 grep mrp
没有成功启动mrp进程,去看alert日志,清楚的告诉你是因为4号文件找不到:
Mon May 15 00:15:04 2023
ALTER DATABASE RECOVER managed standby database disconnect
Attempt to start background Managed Standby Recovery process (jingyu)
Mon May 15 00:15:04 2023
MRP0 started with pid=26, OS id=7056
MRP0: Background Managed Standby Recovery process started (jingyu)
started logmerger process
Mon May 15 00:15:09 2023
Managed Standby Recovery not using Real Time Apply
Mon May 15 00:15:09 2023
Errors in file /u01/app/oracle/diag/rdbms/jycas/jingyu/trace/jingyu_dbw0_5520.trc:
ORA-01157: cannot identify/lock data file 4 - see DBWR trace file
ORA-01110: data file 4: '/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf'
ORA-27037: unable to obtain file status
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
MRP0: Background Media Recovery terminated with error 1110
Errors in file /u01/app/oracle/diag/rdbms/jycas/jingyu/trace/jingyu_pr00_7062.trc:
ORA-01110: data file 4: '/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf'
ORA-01157: cannot identify/lock data file 4 - see DBWR trace file
ORA-01110: data file 4: '/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf'
Slave exiting with ORA-1110 exception
Errors in file /u01/app/oracle/diag/rdbms/jycas/jingyu/trace/jingyu_pr00_7062.trc:
ORA-01110: data file 4: '/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf'
ORA-01157: cannot identify/lock data file 4 - see DBWR trace file
ORA-01110: data file 4: '/u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf'
Completed: ALTER DATABASE RECOVER managed standby database disconnect
Recovery Slave PR00 previously exited with exception 1110
MRP0: Background Media Recovery process shutdown (jingyu)
此时尝试去一级备库备份这个4号数据文件,然后传输到二级备库:
RMAN> backup datafile 4 format '/tmp/dbfile4.bak';
Starting backup at 2023-05-15 00:14:27
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=12 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00004 name=/u01/oradata/JYADG/datafile/o1_mf_users_0r1s27bm_.dbf
channel ORA_DISK_1: starting piece 1 at 2023-05-15 00:14:27
channel ORA_DISK_1: finished piece 1 at 2023-05-15 00:14:28
piece handle=/tmp/dbfile4.bak tag=TAG20230515T001427 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
Finished backup at 2023-05-15 00:14:28
[oracle@db11gadg ~]$ scp /tmp/dbfile4.bak 192.168.1.20:/tmp
oracle@192.168.1.20's password:
dbfile4.bak 100% 1344KB 80.2MB/s 00:00
[oracle@db11gadg ~]$
然后在二级备库恢复这个缺失的4号数据文件:
RMAN> list backup of datafile 4;
using target database control file instead of recovery catalog
specification does not match any backup in the repository
RMAN> catalog start with '/tmp/dbfile4.bak';
searching for all files that match the pattern /tmp/dbfile4.bak
List of Files Unknown to the Database
=====================================
File Name: /tmp/dbfile4.bak
Do you really want to catalog the above files (enter YES or NO)? yes
cataloging files...
cataloging done
List of Cataloged Files
=======================
File Name: /tmp/dbfile4.bak
RMAN> restore datafile 4;
Starting restore at 2023-05-15 00:22:12
using channel ORA_DISK_1
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00004 to /u01/oradata/JYCAS/datafile/o1_mf_users_1t1s5uaq_.dbf
channel ORA_DISK_1: reading from backup piece /tmp/dbfile4.bak
channel ORA_DISK_1: piece handle=/tmp/dbfile4.bak tag=TAG20230515T001427
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:01
Finished restore at 2023-05-15 00:22:14
RMAN>
此时查看alert日志,也是正常恢复:
Mon May 15 00:22:13 2023
Full restore complete of datafile 4 /u01/oradata/JYCAS/datafile/o1_mf_users_l622onyr_.dbf. Elapsed time: 0:00:00
checkpoint is 5941162
last deallocation scn is 3
需要注意下因为使用OMF,恢复的4号数据文件跟之前的名字不一样了!但是不影响。
现在再尝试启动mrp进程恢复就OK了。
SQL> select file#, CHECKPOINT_CHANGE#, STATUS from v$datafile;
FILE# CHECKPOINT_CHANGE# STATUS
---------- ------------------ -------
1 5941162 SYSTEM
2 5941162 ONLINE
3 5941162 ONLINE
4 5941162 ONLINE
5 5941162 ONLINE
6 5941162 ONLINE
7 5541098 ONLINE
8 5541108 ONLINE
8 rows selected.
SQL> recover managed standby database disconnect;
Media recovery complete.
SQL> !ps -ef|grep mrp
oracle 7687 1 0 00:26 ? 00:00:00 ora_mrp0_jingyu
oracle 7713 7606 0 00:27 pts/1 00:00:00 /bin/bash -c ps -ef|grep mrp
oracle 7715 7713 0 00:27 pts/1 00:00:00 grep mrp
SQL> select file#, CHECKPOINT_CHANGE#, STATUS from v$datafile;
FILE# CHECKPOINT_CHANGE# STATUS
---------- ------------------ -------
1 5942030 SYSTEM
2 5942030 ONLINE
3 5942030 ONLINE
4 5942030 ONLINE
5 5942030 ONLINE
6 5942030 ONLINE
7 5541098 ONLINE
8 5541108 ONLINE
8 rows selected.
SQL> @dg
NAME VALUE UNIT TIME_COMPUTED DATUM_TIME
---------------------- ---------------------- ------------------------------ ------------------------------ ------------------------------
transport lag +00 00:17:22 day(2) to second(0) interval 05/15/2023 00:28:48 05/15/2023 00:27:53
apply lag day(2) to second(0) interval 05/15/2023 00:28:48
apply finish time day(2) to second(3) interval 05/15/2023 00:28:48
estimated startup time 9 second 05/15/2023 00:28:48
此时主库切换一下日志,再次查询级联备库状态:
SQL> @dg
NAME VALUE UNIT TIME_COMPUTED DATUM_TIME
---------------------- ---------------------- ------------------------------ ------------------------------ ------------------------------
transport lag +00 00:00:00 day(2) to second(0) interval 05/15/2023 00:29:14 05/15/2023 00:29:05
apply lag +00 00:00:00 day(2) to second(0) interval 05/15/2023 00:29:14 05/15/2023 00:29:05
apply finish time day(2) to second(3) interval 05/15/2023 00:29:14
estimated startup time 9 second 05/15/2023 00:29:14
SQL> select file#, CHECKPOINT_CHANGE#, STATUS from v$datafile;
FILE# CHECKPOINT_CHANGE# STATUS
---------- ------------------ -------
1 5943460 SYSTEM
2 5943460 ONLINE
3 5943460 ONLINE
4 5943460 ONLINE
5 5943460 ONLINE
6 5943460 ONLINE
7 5541098 ONLINE
8 5541108 ONLINE
8 rows selected.
SQL> @dg
NAME VALUE UNIT TIME_COMPUTED DATUM_TIME
---------------------- ---------------------- ------------------------------ ------------------------------ ------------------------------
transport lag +00 00:00:48 day(2) to second(0) interval 05/15/2023 00:29:56 05/15/2023 00:29:54
apply lag +00 00:00:48 day(2) to second(0) interval 05/15/2023 00:29:56 05/15/2023 00:29:54
apply finish time day(2) to second(3) interval 05/15/2023 00:29:56
estimated startup time 9 second 05/15/2023 00:29:56
说明这个恢复缺失数据文件再开启mrp进程的方案可行。
但出于从根本解决问题的初衷,我们还是会建议找到具体原因重新跑的方式来恢复。