Tuesday, November 15, 2005


On-line disaster-Recovery Procedure for crashed IBM Informix Enterprise Replication Server 9.2(3,4)



Assume that g_cp ,g_fw ,g_tp ,g_sc are Enterprise Replication servers participating in Update Anywhere Enterprise Replication. All of them are based Informix Dynamic Server 9.2UC2 Hardware platform: RS/6000,AIX 4.3 (4 CPU’s, 2 GB of RAM).
Server_xx is a host name for corresponding ER Server g_xx and corresponds entry
in SQLHOSTS file on each one of ER servers.At certain point of time ER server g_cp crashes(we presume that Informix Dynamic Server on server_cp is still alive, otherwise first delete is not necessary),then issue on server_cp:


informix@server_cp$ cdr delete server g_cp
issue on server_tp (or any other g_cp,g_fw,g_sc):
informix@server_tp$ cdr delete server g_cp -connect server_tp




The first command removes ER server from local global catalog
and removes the ER connections to others hosts.
The second command removes ER server from all other syscdr
databases, i.e. from all other ER servers on the system

When server_cp is ready to go:-


informix@server_cp$ cdr def server -connect server_cp -I -S g_tp g_cp -A $INFORMIXDIR/ats -R $INFORMIXDIR/ris

-A option defines Aborted Transaction Spooling Directory;
-R option defines Row Information Spooling Directory




Declare Replications on g_cp ,run script change_repl.ksh:-


#!/usr/bin/ksh
for TABLE in `cat table_list`
do
cdr change replicate -a repl_${TABLE} sitesdata@server_cp:informix.${TABLE} "select * from ${TABLE}"
if [ $? == 0 ] then
echo "repl _"${TABLE}" updated OK"
else
echo "repl_"$(TABLE}" update failed"
exit 1
fi
done




Start Replications on g_cp ,run script start_repl.ksh:-


#!/usr/bin/ksh
for TABLE in `cat table_list`
do
cdr start replication repl_${TABLE} g_cp
if [ $? == 0 ] then
echo "repl_"${TABLE}" started OK"
else
echo "repl_"${TABLE}" start failed"
exit 1
fi
done




Suspend all ER servers , run command:-


informix@server_cp$ cdr suspend server g_cp g_tp g_fw g_sc




Since this moment all transactions will be queued but not replicated
Run script unload.ksh on server_tp (g_tp) :-


informix@server_tp$ nohup onpunload.ksh>unload.log 2>&1 &

#!/usr/bin/ksh
#####################################################################
# onpunload.ksh. Script invokes onpload utility
# to unload data on any running ER server
# For each TABLE value in table_list file:
# unload job named unload_${TABLE} for each replicated table is already
# created in HPL environment and stored in onpload database.
# Autogenerate Load Components panel
# configured output file as /dataserver/unload/${TABLE}.dat,
# target database name
# as "sitesdata" and table as ${TABLE}
#######################################################################
for TABLE in `cat table_list`
do
onpload –p sites –j unload_${TABLE} -fu
if [ $? == 0 ] then
echo "repl_"${TABLE}" unloaded OK"
else
echo "repl_"${TABLE}"unload failed"
exit 1
fi
done




Compress all unloaded ASCII files ,create tar ball and download to
/dataserver/load on server_cp.Untar ball and uncompress *.Z files
in /dataserver/load directory on server_cp,
then run script to load data.


informix@server_cp$ nohup onpload.ksh>load.log 2>&1 &

#!/usr/bin/ksh
###############################################################################
# onpload.ksh
# Script invokes onpload utility to load data on ER server
# supposed to be synchronized.
# For each TABLE value in table_list file:
# load job named load_${TABLE} for each replicated table is already created with
# Deluxe without replication menu option in High Performance Loader Environment
# and stored in onpload database. This feature became available since HPL v 9.2
# Autogenerate Load Components panel configured input file as #dataserver/load/
#${TABLE}.dat,
# target database “sitesdata” and table as ${TABLE}
################################################################################
for TABLE in `cat table_list`
do
onpload –p sites j –load_${TABLE} -fcl
if [ $? == 0 ] then
echo ${TABLE}" loaded OK"
else
echo ${TABLE}" load failed"
exit 1
fi
done




To start replication of queued transactions run command:-

   
informix@server_tp$ cdr resume server g_cp g_tp g_fw g_sc




Since this point system is passing through extremely dangerous phase.
Transaction rate could be very high on each one of ER servers involved
into Udate-Anywhere Enterprise Replication. Sizes of send and receive dbspaces
as well as total length Logical Logs should be tuned very carefully to accommodate
OLTP hit. LTXHWM must be more then 2*LTHWM, i.e long transaction should be
rollbacked before LTXHWM is reached.DDR threads could several times start and finish
catch up phase on g_cp (especially),g_fw,g_tp,g_sc.
This behavior is normal during synchronizing