Monday, 27 July 2009

Oracle Replication Broken jobs and how to fix them...


This article is related to Replication broken jobs. It will demonstrate how to build the broken jobs automatically and logging of the information. The auto build of the broken job is a great thing, which should be implemented for every replication setup. If there are many replication setups, then it can give the real benefit. And the DBA / Replication Administrator can see the logs to have an idea of the broken jobs. This article concentrates on the Multi Master Advanced Replication setup.


Replication makes the data available on more than one location. It offers redundancy at the database level. Replication works over long distances and WANS. It provides a copy of the database at other location. Some of the reasons to go for Replication are Availability, performance, disconnected computing, network load reduction,

The Replication is a complex environment, and it is difficult to maintain, because of its complexity. Normally the data which transfers from one database to another database can be out (no sync) with the original copy, the problem can be because of the broken jobs. In Multi Master Advanced Replication the data is pushed with the help of job queue. If the connectivity between the Replicated Databases becomes broken, then the job will try to push the data 16 times. And after 16 times the status of that particular job will be marked as “Y”, which means that job has broken. Now the transactions will start getting queued till the job is build again. The longer the time between making the broken job, the longer the queue, with transactions. So if the transactions start getting stuck, the data will not reach to the destination, and both the copies will start to become out of sync. Now imagine if there are many Replication setups, the network connectivity problem may arise any time, and no DBA / Replication Administrator can monitor the Replication setups all the times.


The steps mentioned here should be run on the Test Replication Setup first.

The broken jobs can be made either manual which is truly painful. (Only those can imagine who are handling the Replication Setups). The other option is to make the broken jobs automatically. In this section we will demonstrate how we can build the broken jobs automatically.


1- Login to the Database

Log in to the Test Database with the REPADMIN User.

1- Build the Log Table.

The log table can be used to log every make of the broken job. So after some time, the DBA or Replication Administrator can see the log tables, and can get an idea how frequently the jobs are breaking.

Name Null? Type

---------------------------- -------- ------------





2- Build the procedure to make the Broken Jobs

This procedure will check the broken job, and if there is any broken job, it will make it and will log the information as well.




/* Procedure to make the broken jobs in replication

This procedure will check the broekn jobs and if any

job is broken, then it will make them again


cursor test_broken_jobs is

select job,broken,failures

from dba_jobs

where upper(what) like '%DBMS_DEFER_SYS%' and broken='Y';

err_num number;

err_msg varchar2(150);


for broken_job in test_broken_jobs



/* dbms_job.broken(broken_job.job,FALSE,SYSDATE+10/86400); */

insert into broken_jobs values (broken_job.job,broken_job.broken,

broken_job.failures,to_char(sysdate,'dd mon yyyy hh24:mi,ss'));


/*; */



err_num := SQLCODE;

err_msg := SUBSTR(SQLERRM, 1, 100);

INSERT INTO job_errors VALUES (err_num,

err_msg,to_char(sysdate,'dd mon yyyy hh24:mi,ss'));



end loop;


3- Submit the job in the Database.

The UNBREAK_JOBS_TEST can be submitting with the help of DBMS_JOB to make it auto run.

SQL> variable jobno number;

SQL> begin





The job will run after each 90 seconds i.e. 3 minutes, this can be set to the particular requirement of the Replication Setup.

4- Test the UNBREAK_JOBS_TEST procedure.

Now the replication job can be broken by use of DBMS_JOB.BROKEN(JOB_NUMBER,TRUE);

Its broken status can be checked by the SQL statement:

SQL> select job,broken,failures

from dba_jobs

where upper(what) like '%DBMS_DEFER_SYS%'


And after wait of 90 seconds, again run the query the above job should be made by that UNBREAK_JOBS_TEST procedure.


By testing this thoroughly, this can be implemented on the Production Replication setups. It can make the life of DBA and Replication Administrators easy.

No comments: