3节点rac,主机重启后,ogg更换节点挖掘,传输进程出现如下报错。
2022-02-08 09:52:11 ERROR OGG-01705 Oracle GoldenGate Capture for Oracle, p1_pd.prm: Input checkpoint position
45970013
for input trail file './dirdat/pd016425' is greater than the size of the file (
45941215
). Please consult Oracle Knowledge Management Doc ID 1138409.1. for instructions.
在goldengate中,传输进程和挖掘进程在交换数据的时候传输进程是从cache区域中去抓取数据并传送到目标端,而不是等到真的写到磁盘后(可以提高性能)。当意外down机时系统来不及将cache中的内容写到磁盘,出现了传输进程新建的检查点是基于cache中的信息更新的,而trail文件的大小实际上是要比检查点写的RBA小的。
当下次启动时,进程进行恢复动作,并将比文件大的一部分内容写到了下一个trail文件中(挖掘进程启动的时候会etrollover)。所以,有两种方法可以回复进程。一种是将进程的检查点跳到下一个trail的指定RBA开始传输,一种是进行重新初始化。排查过程如下:
1.查看ggerr.log日志报错信息如下:
ERROR OGG-01705 Oracle GoldenGate Capture for Oracle, p1_pd.prm: Input checkpoint position 45970013 for input trail file './dirdat/pd016425' is greater than the size of the file (45941215). Please consult Oracle Knowledge Management Doc ID 1138409.1. for instructions.
2.备份进程当前检查点位置:
EXTRACT P1_PD Last Started 2022-02-08 09:52 Status ABENDED
Checkpoint Lag 08:46:00 (updated 00:00:53 ago)
Log Read Checkpoint File ./dirdat/pd016425
2022-02-08 01:06:11.000000 RBA
45970013
3.确认trail文件大小(通常RBA的位置已经超过了trail大小):
-rw-r--r-- 1 ogguser dba
45941215 Feb 8 08:56 pd016425
-rw-r----- 1 ogguser oinstall 102065502 Feb 8 08:56 pd016426
根据原理,如果需要手工恢复,则需要知道如下信息:
seqno为X的这个trail比RBA小的这部分内容在X+1这个Trail中的具体位置
根据官方文档我们有三种方法来获取这个具体位置。
方法一
通过公式:进程新RBA=报错信息的过大RBA( 变量A)+新trail中的第一条记录的RBA( 变量B)-进程失败trail中包含下个trail第一条记录事务的起始RBA( 变量C)
1.查看报错信息中(或者INFO查看进程当前RBA)为: 变量A;
EXTRACT P1_PD Last Started 2022-02-08 09:52 Status ABENDED
Checkpoint Lag 08:46:00 (updated 00:00:53 ago)
Log Read Checkpoint File ./dirdat/pd016425
2022-02-08 01:06:11.000000 RBA
45970013(变量A)
或者:
2022-02-08 09:52:11 ERROR OGG-01705 Oracle GoldenGate Capture for Oracle, p1_pd.prm: Input checkpoint position
45970013 for input trail file './dirdat/pd016425' is greater than the size of the file (45941215). Please consult Oracle Knowledge Management Doc ID 1138409.1. for instructions.
2.使用Logdump工具找到下一个trail中的第一条记录: 变量B;
Logdump 1 >open ./dirdat/pd016426
Current LogTrail is /oggdata/ogg12c/dirdat/pd016426
Logdump 2 >ghdr on
Logdump 3 >n
2022/02/08 08:56:01.354.224 FileHeader Len 1707 RBA 0
Name: *FileHeader*
3000 02f9 3000 0008 4747 0d0a 544c 0a0d 3100 0002 | 0...0...GG..TL..1...
0004 3200 0004 2000 0000 3300 0008 02f2 fdaa a308 | ..2... ...3.........
a1f0 3400 0024 0022 7572 693a 6463 7064 6462 3031 | ..4..$."uri:dcpddb01
3a3a 6f67 6764 6174 613a 6f67 6731 3263 3a45 315f | ::oggdata:ogg12c:E1_
5044 3600 0013 0011 2e2f 6469 7264 6174 2f70 6430 | PD6....../dirdat/pd0
3136 3432 3637 0000 0101 3800 0004 0000 402a 3900 | 164267....8.....@*9.
0008 0000 0000 0615 655e 3a00 0081 0130 0000 0000 | ........e^:....0....
Logdump 4 >n
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x00)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 0 (x0000) IO Time : 2022/02/08 08:56:01.320.554
IOType : 150 (x96) OrigNode : 0 (x00)
TransInd : . (x03) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 0 AuditPos : 0
Continued : N (x00) RecCount : 0 (x00)
2022/02/08 08:56:01.320.554 RestartAbend Len 0 RBA 1715
Name:
After Image: Partition 0 G s
Logdump 5 >n
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x04)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 43 (x002b) IO Time : 2022/02/08 01:06:07.000.000
IOType : 15 (x0f) OrigNode : 255 (xff)
TransInd : . (x03) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 44481 AuditPos : 37992464
Continued : N (x00) RecCount : 1 (x01)
2022/02/08 01:06:07.000.000 FieldComp Len 43 RBA
1775
Name:
TOPS5A.SYS_JOBRUNINFOS
After Image: Partition 4 G s
0000 000e 0000 000a 4a6f 6273 4368 616e 6765 0001 | ........JobsChange..
0015 0000 3230 3232 2d30 322d 3038 3a30 313a 3036 | ....2022-02-08:01:06
3a30 36 | :06
注:如果新trail没有第一条记录,则依次往后找,直到找到有记录的trail为止。
3.找到新trail事务在旧trail中的起始位置:变量C:
1).通过步骤2,我们找到了新trail的第一条记录RBA为1775.而这条记录对应的AuditRBA=44481, AuditPos= 37992464 这表明此事务在Oracle数据库中的44481号redo的 RBA: 37992464处。另外还需要记录此记录的表名。
2).在旧的trail中查找以上事务的起始点:
Logdump 6 >open ./dirdat/pd016425
Current LogTrail is /oggdata/ogg12c/dirdat/pd016425
Logdump 7 >ghdr on
Logdump 8 >filter include AuditRBA 44481
Logdump 9 >filter include filename TOPS5A.SYS_JOBRUNINFOS
Logdump 10 >filter match all
Logdump 11 >n
--下面一段为官网摘抄,不是本次具体数值,本次C值为方法二获得
找到记录:
Hdr-Ind : E (x45) Partition : . (x04)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 114 (x0072) IO Time : 2011/11/28 05:54:08.000.919
IOType : 15 (x0f) OrigNode : 255 (xff)
TransInd : . (x00) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 14735 AuditPos : 272667152 ------------ same as the one in step 2
Continued : N (x00) RecCount : 1 (x01)
2011/11/28 05:54:08.000.919 FieldComp Len 114 RBA 160284255 -->我们要找的RBA
Name: SCHEMA.XXXX
After Image: Partition 4 G b
0000 000c 0000 0008 3130 3532 3030 3839 0001 000c | ........10520089....
0000 0008 3130 3534 3639 3532 0002 0008 0000 0004 | ....10546952........
3633 3737 0003 0010 0000 000c 4461 7669 732c 204d | 6377........Jacob, M
696c 6573 0004 0015 0000 3230 3039 2d30 342d 3235 | iles......2009-05-25
3a30 323a 3338 3a35 3900 0500 1500 0032 3031 312d | :02:38:59......2011-
3131 2d32 383a 3035 3a35 343a 3037 | 11-28:05:54:07
3).获得此条记录的RBA为(变量C): 160284255
4.计算进程应该在新trial的哪个位置:
应该在新trail的RBA=160374765+1537-160284255 =92047
5.基于新的RBA启动进程:
Ggsci> alter rep < rep name>, extseqno 27506, extrba 92047.
方法二
基于公式:进程新RBA=报错信息的过大RBA( 变量A)+新trail中的第一条记录的RBA( 变量B)-旧Trail的文件大小( 变量C)
1.获取变量A:同方法一中办法
得到变量A:
45970013
2.获取变量B:同方法一中办法
得到变量B:
1775
3.获取变量C:文件系统查看:
OGG-01705 Oracle GoldenGate Capture for Oracle, p1_pd.prm: Input checkpoint position 45970013 for input trail file './dirdat/pd016425' is greater than the size of the file (
45941215,变量C). Please consult Oracle Knowledge Management Doc ID 1138409.1. for instructions.
4.计算新进程RBA:
45970013+1775-45941215=30573
5.基于新的RBA启动:
alter P1_PD extseqno 16426 extrba 30573
方法三
通过以上两个方法我们发现在相同的案例中使用不同的方法得到的结果也可能不一样,也就是说不一定有效。(如果数据量不大最好重新初始化)。
此方法直接参照MOS
This option should only be used if option 1 and option 2 failed.
Please refer the bug- 13955891
A live example which explains this scenario:
1) GGSCI (OGGORACLE)1> info rep01prd, detail
REPLICAT REP01PRD Last Started 2012-04-11 12:05 Status ABENDED
Checkpoint Lag 00:00:00 (updated 08:24:18 ago)
Log Read Checkpoint File /opt/GoldenGateV11/dirdat/rp002289
2012-04-11 04:38:15.998847 RBA 499003921
2. sh>ls -tlr dirdat
-rw-rw-rw- 1 oggadm dba 499999882 Apr 11 01:52 /opt/GoldenGateV11/dirdat/rp002288
-rw-rw-rw- 1 oggadm dba 498003983 Apr 11 06:15 /opt/GoldenGateV11/dirdat/rp002289 << where the checkpoint is pointing
-rw-rw-rw- 1 oggadm dba 499999920 Apr 11 06:16 /opt/GoldenGateV11/dirdat/rp002290 << the next available trail file
Here the actual trail file size, End of File (EOF), is 498003983 where the replicat read checkpoint rba is at 499003921
The formula to find the correct RBA is
New datapump / Replicat RBA = Reader's too-big checkpoint RBA(A) - Repaired record RBA in existing trail (B)+ First record RBA in the new trail file (after the restart abend)(c)
Let us call
Reader's too-big checkpoint RBA -- A
Repaired record RBA in existing trail -- B
First record RBA in the new trail file (after the restart abend) -- C
We know the Reader's too-big checkpoint RBA (A) =
499003921( from the info output) -- step 1
We could get the Repaired record RBA in existing trail-- B -- step 2
----------------------------------------------------------
Logdump 979 >open rp002289
Current LogTrail is C:\GG\V11111\rp002289
Logdump 980 >ghdr on
Logdump 981 >count
LogTrail C:\GG\V11111\rp002289 has
2352689 records
Total Data Bytes 272734228
Avg Bytes/Record 115
Delete 14237
Insert 2192433
FieldComp 146014
GGSPKUpdate 1
LargeObject 3
Repaired Records 1
Others 1
Before Images 14237
After Images 2338451
Average of 265671 Transactions
Bytes/Trans ..... 1451
Records/Trans ... 8
Files/Trans ..... 0
Logdump 982 >skip
2352688 -- has 2352689 records -1
Logdump 983 >n
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x04)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 251 (x00fb) IO Time : 2012/04/11 05:38:08.999.658
IOType : 5 (x05) OrigNode : 255 (xff)
TransInd : . (x03) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 336307 AuditPos : 89724944
Continued : N (x00) RecCount : 1 (x01)
2012/04/11 05:38:08.999.658 Insert Len 251 RBA 498003230
Name: PRODDTA.F0911T
After Image: Partition 4 G s
0000 0004 0000 4841 0001 000a 0000 0006 3739 3436 | ......HA........7946
3536 0002 0007 0000 3030 3333 3000 0300 0a00 0000 | 56......00330.......
0000 0000 01b5 e600 0400 0400 0041 4100 0500 0500 | .............AA.....
0000 0138 0006 0004 0000 2020 0007 0003 0000 2000 | ...8...... ...... .
0800 0300 0020 0009 0003 0000 2000 0a00 0300 0020 | ..... ...... ......
000b 0003 0000 2000 0c00 0300 0020 000d 0003 0000 | ...... ...... ......
2000 0e00 0300 0020 000f 0003 0000 2000 1000 0300 | ...... ...... .....
Logdump 984 >n
Repaired Len 369 RBA
498003606
4800 0031 4504 0041 00fb 05ff 02f1 e397 7402 50ea | H..1E..A........t.P.
0000 0000 0559 2010 0005 21b3 0352 0000 0001 5052 | .....Y ...!..R....PR
4f44 4454 412e 4630 3931 3154 0044 0000 fb00 0000 | ODDTA.F0911T.D......
0400 0048 4100 0100 0a00 0000 0637 3934 3635 3600 | ...HA........794656.
0200 0700 0030 3033 3330 0003 000a 0000 0000 0000 | .....00330..........
0001 b5e6 0004 0004 0000 4141 0005 0005 0000 0001 | ..........AA........
3900 0600 0400 0020 2000 0700 0300 0020 0008 0003 | 9...... ...... ....
0000 2000 0900 0300 0020 000a 0003 0000 2000 0b00 | .. ...... ...... ...
0300 0020 000c 0003 0000 2000 0d00 0300 0020 000e | ... ...... ...... ..
0003 0000 2000 0f00 0300 0020 0010 0003 0000 2000 | .... ...... ...... .
1100 0300 0020 0012 000e 0000 2020 2020 2020 2020 | ..... ......
2020 2020 0013 0003 0000 2000 1400 0e00 0020 2020 | ...... ......
2020 2020 2020 2020 2000 1500 0300 0020 0016 000e | ...... ....
0000 2020 2020 2020 2020 2020 2020 0017 0003 0000 | .. ......
2000 1800 0e00 0020 2020 2020 2020 2020 2020 2000 | ...... .
1900 0500 0000 0130 5400 0039 5200 0014 4141 4142 | .......0T..9R...AAAB
3936 4143 5441 4147 3156 5541 417a 0001 4c00 000b | 96ACTAAG1VUAAz..L...
3731 3331 3831 3530 3739 3636 0000 0e31 312e 0000 | 713181507966...11...
0000 0000 0000 0000 00 | .........
Logdump 985 >
-----------------------------------------------
Here we could get the RBA for (B) as 498003606
-----------------------------------------------------------------------------------------------------
First record RBA in the new trail file (after the restart abend)(c) --- Step 3
-----------------------------------------------------------------------------------------------------
Logdump 985 >open rp002290
Current LogTrail is C:\GG\V11111\rp002290
Logdump 986 >n
2012/04/11 07:15:49.756.239 FileHeader Len 1307 RBA 0
Name: *FileHeader*
3000 01f3 3000 0008 4747 0d0a 544c 0a0d 3100 0002 | 0...0...GG..TL..1...
0002 3200 0004 2000 0000 3300 0008 02f1 e398 d156 | ..2... ...3........V
5d4f 3400 001e 001c 7572 693a 7468 6562 653a 3a6f | ]O4.....uri:thebe::o
7074 3a47 6f6c 6465 6e47 6174 6556 3131 3500 0044 | pt:GoldenGateV115..D
3500 001e 001c 7572 693a 7468 6562 653a 3a6f 7074 | 5.....uri:thebe::opt
3a47 6f6c 6465 6e47 6174 6556 3131 3500 001e 001c | :GoldenGateV115.....
7572 693a 7468 6562 653a 3a6f 7074 3a47 6f6c 6465 | uri:thebe::opt:Golde
Logdump 987 >n
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x00)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 0 (x0000) IO Time : 2012/04/11 07:15:49.762.966
IOType : 150 (x96) OrigNode : 0 (x00)
TransInd : . (x03) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 0 AuditPos : 0
Continued : N (x00) RecCount : 0 (x00)
2012/04/11 07:15:49.762.966 RestartAbend Len 0 RBA 1315
Name:
After Image: Partition 0 G s
Logdump 988 >n
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x04)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 251 (x00fb) IO Time : 2012/04/11 05:38:09.014.430
IOType : 5 (x05) OrigNode : 255 (xff)
TransInd : . (x03) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 336307 AuditPos : 89726992
Continued : N (x00) RecCount : 1 (x01)
2012/04/11 05:38:09.014.430 Insert Len 251 RBA
1378
Name: PRODDTA.F0911T
After Image: Partition 4 G s
0000 0004 0000 4841 0001 000a 0000 0006 3739 3436 | ......HA........7946
3536 0002 0007 0000 3030 3333 3000 0300 0a00 0000 | 56......00330.......
0000 0000 01b5 e600 0400 0400 0041 4100 0500 0500 | .............AA.....
0000 0139 0006 0004 0000 2020 0007 0003 0000 2000 | ...9...... ...... .
0800 0300 0020 0009 0003 0000 2000 0a00 0300 0020 | ..... ...... ......