Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1310404.1
Update Date:2011-05-11
Keywords:

Solution Type  Problem Resolution Sure

Solution  1310404.1 :   VTL - Failover and failback issue due to timeout  


Related Items
  • Sun StorageTek VTL Plus Storage Appliance
  •  
Related Categories
  • GCS>Sun Microsystems>Storage - Tape>Tape Virtualization
  •  


Problem with VTL failback due to timeout.

In this Document
  Symptoms
  Changes
  Cause
  Solution


Created from <SR 3-2558289527>

Applies to:

Sun StorageTek VTL Plus Storage Appliance - Version: 2.0 - Build 1656 and later   [Release: 2.0 and later ]
Information in this document applies to any platform.

Symptoms

VTL node1 failed over to node2, but node1 did not come Ready for failback

Node2 panic'd, causing complete outage.

Possible mutual failover situation

Changes

Issue occurred after applying VTL Get Well Plan (GWP), but this does not appear to be a cause of the events

Cause


Problem was due to the fact that the VTL node hasn't finished loading the resources when it starts taking over the partner node during simultaneous node boot up.

Solution


This issue is resolved by introducing a delay during the startup sequence, to allow self monitoring module to finish loading resources before taking over the partner.

To add the delay, modify the "ipstorfm.sh" script (adding 2 lines) as indicated in code segment below
(/usr/local/vt/bin):

...
# check if ipstorfm is running already
# if it is, return with an error
APID=`$IS_BIN/pidof ipstorfm`
NUM_P=`echo $APID | awk 'BEGIN{} {print NF}'`
if [ $NUM_P -ne 0 ]
then
      RET=1
else
      logger -p daemon.notice Sleeping 500 seconds before starting FM.     <<< added line >>>
      sleep 500                                                                                                   <<< added line >>>
      $IS_BIN/ipstorfm $2&
      sleep 1
      APID=`$IS_BIN/pidof ipstorfm`
      NUM_P=`echo $APID | awk 'BEGIN{} {print NF}'`
      if [ $NUM_P -eq 0 ]
      then
            RET=1
       else
             RET=0
       fi
fi
...


NOTE: It is also recommended that in a failover configuration, each node is rebooted one at a time and not simultaneously, which will also avoid this situation.

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback