Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1002033.1
Update Date:2010-09-16
Keywords:

Solution Type  Problem Resolution Sure

Solution  1002033.1 :   Sun Fire[TM] v1280, E2900, 3800, 4800, 4810, 6800, E4900, E6900, and Netra 1280, 1290 Server: How to Recover from a Hung System Controller  


Related Items
  • Sun Fire E6900 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Netra 1280 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire 4810 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  

PreviouslyPublishedAs
202844


Symptoms
When a system controller (SC) is hung, try a few steps before pressing the Reset button on the SC.


Resolution
Try the following steps:
1) Try to Telnet or directly connect to the serial port of the "hung" SC, TIP into the platform shell, and use the "reboot" command.

2) If the "reboot" command does not work, or you cannot enter anything, log in to the spare SC and try to force a failover by using the "setfailover force" command.

  • This step is not available on Sun Fire[TM] v1280, E2900, and Netra 1280, 1290 servers.
  • This step will probably not work if the primary SC is completely hung. 
If this step does work, it will reboot the hung SC and make the spare SC the primary SC.

3) If failover does not complete, the LAST RESORT is to use the Reset button on the SC.

BEFORE YOU PRESS THIS BUTTON, you must bring down the domains. Bringing down the domains is critical because there is a possibility that the domain will crash if the Reset button is pressed and the domains are up and running.  See <Document: 1004364.1> for details.

NOTE:- Make sure that connections setting are proper on SC.

Use a tip session onto the serial port of the SC:

6800a-sc0:SC>  showplatform -p network

The system controller is configured to be on a network.

Network settings: static
Hostname: 6800a-sc0
IP Address: 129.156.xx.xx
Netmask: 255.255.255.0
Gateway: 129.156.xx.1
DNS Domain: UK.Sun.COM
Primary DNS Server: 129.156.xx.xx
Secondary DNS Server: 129.156.xx.xx
***Connection type: none   <----- No remote access enabled
Idle connection timeout : No timeout
Sun Fire Link Enabled: no
*** This shows remote access via telnet or ssh is not enabled.

Running the command below, changes Connection type :

6800a-sc0:SC> setupplatform -p network

Network Configuration

Is the system controller on a network? [yes]:
Use DHCP or static network settings? [static]:
Hostname [6800a-sc0]:
IP Address [129.156.xx.xx]:
Netmask [255.255.255.0]:
Gateway [129.156.xx.1]:
DNS Domain [UK.Sun.COM]:
Primary DNS Server [129.156.xx.xx]:
Secondary DNS Server [129.156.xx.xx]:
**To enable remote access to the system controller, select "ssh" or "telnet".
**Connection type (ssh, telnet, none) [telnet]:
Idle connection timeout (in minutes; 0 means no timeout) [0]:
Enable Sun Fire Link? [no]: 

To enable remote access to the system controller, select either:

* ssh
* telnet

Rebooting the SC is required, for changes in the above network settings to take effect.



Product
Sun Fire 6800 Server
Sun Fire 4810 Server
Sun Fire 4800 Server
Sun Fire 3800 Server
Sun Fire v1280 Server
Sun Fire E2900 Server
Sun Fire E4900 Server
Sun Fire E6900 Server
Netra 1280 Server
Netra 1290 Server

Internal Comments
If the force option does not initiate a failover and the customer or field personnel are remote from the system thus unable to press the reset button on the hung system controller, there is a risky, non-documented alternative to "waking-up" the hung SC.

Execute the setfailover override command from the spare SC.

Note- This option cannot be seen in setfailover -h command.

Note- As of 5.19.0 firmware this option is available only in service or engineering mode (see Bug ID 4703904 )


The override option ignores whatever the status of the system controller is supposed to be and tells the spare to become primary. It pays no attention to the fact that the other SC could still be primary.


Warning This procedure should be used with caution as a last resort effort, because it could crash running domains.


Example (firmware prior to 5.19.0):


 kremlin-sc1:sc> setfailover override
 SC: SSC1

 Spare System Controller

 SC Failover: disabled

 This will abruptly interrupt operations on the other System Controller.

 This System Controller will become the main System Controller.
 Do you want to continue? [no] yes

 SC Failover did not complete.

 The system controllers may not be synchronized.

 Failover can be done forcefully but may crash domain(s).

 Do you want to force failover to continue? [no] yes

 kremlin-sc1:sc>
 

Example (firmware 5.19.0):


 fort-sc0:sc> setfailover override

 override: is not a valid argument

 Usage: setfailover [-y|-n] off|on|force

        setfailover -h

 fort-sc0:sc>

 fort-sc0:sc> engineering


 fort-sc0:sc[engineering]> setfailover override

 Spare System Controller

 SC Failover: disabled

 Clock failover disabled.

 This will abruptly interrupt operations on this System Controller.

 This System Controller will become the spare System Controller.

 Do you want to continue? [no]

 fort-sc0:sc[engineering]>
 

SunFire, 3800, 4800, 4810, 6800, reset, system controller, failover
Previously Published As
75973

Change History
Date: 2009-11-23
User Name: Josh Freeman
Action: Refreshed
Comment: Refreshed the article per ESG Content Team effort.
Date: 2006-08-29
User Name: 97961
Action: Approved
Comment: - Converted to STM formatting for better readability
- Made simple sentence/grammatical corrections
Version: 3
Date: 2006-08-29
User Name: 97961
Action: Accept
Comment:
Version: 0

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback