Sunday, December 18, 2016

Managed Recovery Stopped Applying Archives on Standby Database After Enabling Apply Delay on The Primary Archive Process

I had a requirement to set the archive apply delay feature between the primary and the standby DB, where the archives will be sent immediately to the standby site but will get applied in a delay of one hour. Such feature enables you to eliminate any erroneous update get executed on the primary from being applied immediately on the standby DB,which gives the DBA some time to think whether to fix the logical corruption or failing over to the standby DB.

Back to my story, On the primary DB, I've set the delay on the archival process that shift the archives to the standby DB to 60 minutes:

SQL> alter system set log_archive_dest_6='service="testprds"','LGWR ASYNC NOAFFIRM delay=60 optional compression=enable max_failure=0 max_connections=1 reopen=300 db_unique_name="testprds" net_timeout=90','valid_for=(all_logfiles,primary_role)';

Restart the archival process: [To let the changes take effect]
SQL> alter system set log_archive_dest_state_6=DEFER;
SQL> alter system set log_archive_dest_state_6=ENABLE;

Restart the recover on the standby which was already started in nodelay mode to start it in delayed mode:
SQL> recover managed standby database cancel;
SQL> recover managed standby database parallel 16 DISCONNECT;

Now I can see in the standby alertlog archives are getting received immediately on the standby and acknowledged with 60 minutes delay for apply:

Archived Log entry 378319 added for thread 1 sequence 681301 ID 0x6a29c0d6 dest 1:
ARCs: Archive log thread 1 sequence 681301 available in 60 minute(s)

Up to here, so far so good, it works as expected, but the horror didn't start yet :-) 
After 1 hour. there is no archives applied on the standby DB.
After 2 hours. Same situation, non of archives applied since I modified the archive process with delay option ! some thing went wrong.

Looks easy! not a rocket science, stop/start the recovery process with NODELAY option and the recovery will start overriding the delay:

Media recovery complete.

ORA-01153: an incompatible media recovery is active

WHAT?! It was just successfully cancelled !

Hopelessly I tried the following troubleshooting techniques, but non of them work:
- Repeating same above commands didn't help. 
- No MRP process was found on OS side to kill.
- Altered the delay parameter back to 0 and restart (defer/enable) the archive process on the primary DB didn't help.
- Killing ora_arc* processes on the standby didn't help.
- Deleting the archives that were marked with 60 minutes delay on standby, and recopy those archives again from primary site did help as well.

Now, it's time to think logically. This error means that although you've cancelled the media recovery there is still another media recovery processes is running didn't get cancelled when you executed "recover managed standby database cancel" command. Is this can happen? Yes it can happen !

So, we need to look into all recovery processes and their statuses:

SQL> select PROCESS,PID,STATUS from v$managed_standby;

-------------   ----------  ------------
RFS            1292        IDLE
RFS            1095        IDLE
MR(fg)          19509       WAIT_FOR_LOG

Yes !, it's the one showing WAIT_FOR_LOG status.

Killed it from Linux shell using its PID:

# kill -9 19509

Now, starting media recovery on standby DB with NODELAY option:

Media recovery complete.

Phew, finally the managed recovery is running!

Someone may ask, why you didn't just restart the standby DB instance and it will force clear any stale recovery process?
The answer, imagine you are using your standby DB in Active Data Guard mode to support read only/reporting applications, restarting the standby database will impact the availability of those applications.

No comments: