VTL - How to Failback Failed Server

Asset ID:	1-71-1013440.1
Update Date:	2009-12-02
Keywords:

Solution Type Technical Instruction Sure

Solution 1013440.1 : VTL - How to Failback Failed Server

Related Items


Sun StorageTek VTL Storage Appliance
 Sun StorageTek VTL Plus Storage Appliance

Related Categories


GCS>Sun Microsystems>Storage - Tape>Tape Virtualization

PreviouslyPublishedAs
218808

Description
What is the failback procedure
How to manually failback to failed server

Steps to Follow
How to Failback Failed Server:

First, check for the common causes of failovers, correct any issues found, and/or contact Sun Support for assistance:
1. Check for any network issues (this is one of the most common reasons for failovers)
2. Check for storage connectivity issues (use VTL Console to check access to disk arrays are good)
3. Check out health of disk arrays (use SANtricity Recovery Guru to check for errors).
  - Proper failed state should show, from GUI, the failed server in RED and takeover server in BLACK.
  - If either server shows YELLOW then it is likely that failover is suspended. If this is the case then place a call to Sun support.

Check the status of the failed VTL server.

Note: The “heartbeat” monitor IP must be used to log into failed server, as the “virtual” IP has moved over to other server that is servicing failed server resources (if you don’t know heartbeat IP, look at GUI under Failover Info tab and it will list both servers IP info):

Verify that all processes are running, issue:
# vtl status

Check “FailOverStatus” status, issue:
# sms –v

Look for "FailOverStatus" in output. If status is "2 (Ready)", then the failed server is ready to be failed back to. Use the GUI to stop takeover and get back to normal.

Check IPs on both servers, issue:
# ifconfig -a

Verify “variable” IP (bge1:1) has successfully moved to surviving server.

Has failed server been rebooted cleanly?

Depending on cause of failover, the failed server may have already been rebooted, but verify the reboot was clean (review messages log). If unsure, reboot again to verify clean reboot (vtl stop, then ;sync;sync;reboot or init 6).

NOTE: If client interruption is not OK, wait until a maintenance window is available to failback (at times the clients do not failback successfully and have to be rebooted to reconnect with virtual devices).

If above steps are correct, then proceed with failback through the GUI.

From Console GUI, right click on active server name, select Failover>Stop Takeover…
- Popup message may appear with message:
WARNING: The primary server is not in a healthy state for failback. If you still want to fail back to the primary server, please type the word YES to proceed. Otherwise, click cancel to exit.

Type YES in box and click OK.

Note: If GUI reports back, discovering servers, close Console and reconnect. This sometimes happens when Virtual IP is switched back to primary server.

Failover can take awhile depending on the number of resources to failback. Can take up to 20 minutes to complete.
- If able to log back into failed server via Console GUI, server name is BLACK (no longer RED)and failover status is “Normal” (select failover folder in right panel to view status), then failback is complete.
- Also, a message in the Console GUI Event Log will say “Primary Server Restored”
- Verify/check again VTL processes (vtl status), failover status (sms -v) and IPs (ifconfig -a)

If Failback through the Console GUI does not work, failback can be done via command line, by issuing the following commands

From the active server (using it's Heartbeat IP), stop the failover module:
# vtl stop fm

From the failed server, verify failed server has taken back control, issue:
# sms -v (may have to issue many times until it returns "1 (UP)")

Once failed server is verified, from active server start the failover module:
# vtl start fm

Check failover status from the GUI. It should be “Normal”.

If failback does not complete OR if RCA is required.
- Collect Xrays (both nodes) and open a case with Sun support.

================================================================
================================================================

Example of sms -v output:

After failover, secondary server took over primary server, log into primary (using Heartbeat IP) and check FailOverStatus. “2(READY)” indicates problems resolved and ready for failback.

[root@failedvtlnode]# sms -v

Last Update by SM: Sun Apr 20 16:31:50 2008
Last Access by RPC: Sun Apr 20 16:31:50 2008

FailOverStatus: 2(READY)

Status of IPStor Server (Transport) : OK
Status of IPStor Server (Application) : OK
Status of IPStor Authentication Module : OK
Status of IPStor Logger Module : OK
Status of IPStor Communication Module : OK
Status of IPStor Self-Monitor Module : OK
Status of IPStor NAS Modules: OK(0)
Status of IPStor Fsnupd Module: OK
Status of IPStor ISCSI Module: OK
Status of IPStor BMR Module: OK( 0)
Status of FC Link Down : OK
Status of Network Connection: OK
Status of force up: 0
Broadcast Arp : NO
Number of reported failed devices : 0
NAS health check : NO
XML Files Modified : NO
IPStor Failover Debug Level : 0
IPStor Self-Monitor Debug Level : 0

Do We Need To Reboot Machine(SM): NO

Do We Need To Reboot Machine(FM): NO

Nas Started: NO

During normal operating status:

[root@activevtlnode]# sms -v

Last Update by SM: Sun Apr 20 16:31:50 2008
Last Access by RPC: Sun Apr 20 16:31:50 2008

FailOverStatus: 1(UP)

Product
Sun StorageTek Virtual Tape Library Storage Appliance
Sun StorageTek Virtual Tape Library Plus Storage Appliance 1.0
Sun StorageTek Virtual Tape Library Plus Storage Appliance 2.0

VTL, Failover, Failback
Previously Published As
STKKB68135

Change History
Updated for currency...

Attachments

This solution has no attachment