Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1321263.1
Update Date:2011-05-19
Keywords:

Solution Type  Technical Instruction Sure

Solution  1321263.1 :   Sun Enterprise[TM] 10000: How To Recover from a Domain Hang Condition  


Related Items
  • Sun Enterprise 10000 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  




In this Document
  Goal
  Solution


Applies to:

Sun Enterprise 10000 Server - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Information in this document applies to any platform.

Goal

This document provides a step by step process for recovering an E10000 domain from a hang condition.

Solution

If a domain hangs, follow these steps for dumping core and/or recovering the domain. The SUNW_HOSTNAME environment variable must be set to the name of the problem domain via the domain_switch command.

Step Action(s) Notes
1 ssp% hostinfo -h

ssp% ping <domain>
ssp% ps -ef | grep bringup
ssp% ps -ef | grep hpost

This is to establish the domain state.
2 ssp% hostint
Wait at least 5 minutes for the panic to complete. Do not assume console activity is working. Spot check the machine state by repeating Step 1.
3 ssp% hostint -p <alternate cpu>
Preferably, the <alternate cpu> is on a different system board than the bootproc. At a minimum, the alternate should use a different BBSRAM (bootproc+2).

Wait at least 5 minutes for the panic to complete. Do not assume console activity is working. Spot check the machine state by repeating Step 1.

4 ssp% sigbcmd panic
Wait at least 10 minutes for the panic to complete. Do not assume console activity is working. Spot check the machine state by repeating Step 1.
5 ssp% sigbcmd -p <alternate cpu> panic
Preferably, the <alternate cpu> is on a different system board than the bootproc. At a minimum, the alternate should use a different BBSRAM (bootproc+2).

Wait at least 10 minutes for the panic to complete. Do not assume console activity is working. Spot check the machine state by repeating Step 1.

6 ssp% sigbcmd -I panic
Wait at least 10 minutes for the panic to complete. Do not assume console activity is working. Spot check the machine state by repeating Step 1.
7 ssp% sigbcmd -I -p <alternate cpu> panic
Preferably, the <alternate cpu> is on a different system board than the bootproc. At a minimum, the alternate should use a different BBSRAM (bootproc+2).

Wait at least 10 minutes for the panic to complete. Do not assume console activity is working. Spot check the machine state by repeating Step 1.

8 ssp% sigbcmd obp
If the OBP ok> prompt is reached, execute explicitly:
    ok> ctrace
    ok> .registers

    ok> .locals
    ok> sync

Capture all the screen output and provide them with the panic dump generated by the OBP sync.
9 ssp% sigbcmd -I -p <alternate cpu> panic
Preferably, the <alternate cpu> is on a different system board than the bootproc. At a minimum, the alternate should use a different BBSRAM (bootproc+2).

If the OBP ok> prompt is reached, execute explicitly:

    ok> ctrace
    ok> .registers

    ok> .locals
    ok> sync

Capture all the screen output and provide them with the panic dump generated by the OBP sync.
10 ssp% bringup -f -l64 Force the bringup only if all other attempts fail. At a minimum, the level needs to be 24 to test CPU operation.

Of course, savecore must be enabled and the primary swap partition/dump device must be sufficiently large for the core file to be saved.

Why wait so long between steps?

Five or ten minutes is an ideal time, even buffered a little.  Too often, people execute one command, wait a few seconds and conclude it is not working, when in fact, it very well might be. At a minimum, re-executing the hostinfo and ps commands from Step 1 (pingable, but hung domains are rare, but do exist) will slow the process down, and allow a given command to commence and show progress. Bottom line, proceed with some caution and execute with a plan, but don't hurry.

Allowing a few extra minutes now may be what is necessary to minimize domain interrupts in the future by guaranteeing that whatever information we might gather can be used to identify their problem. It is known that this is not a popular position in the "heat of the moment", but extraordinary circumstances require extraordinary action. Remember, this is not the mainstream situation, and we must take every action necessary to ensure we collect data when we can.


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback