Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1010335.1
Update Date:2011-05-10
Keywords:

Solution Type  Problem Resolution Sure

Solution  1010335.1 :   Sun Fire[TM] 12K/15K/E20K/E25K: Identifying and recovering from a domain hang  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
214177
Identifying and recovering from a hung domain

Applies to:

Sun Fire 12K Server
Sun Fire 15K Server
Sun Fire E20K Server
Sun Fire E25K Server
All Platforms

Symptoms

Domain is no longer responding

Changes

No change required to hang a domain.

Cause

Root cause to be determined once the described procedure for data collection has been completed successfully.

Solution

There are several tools to use when trying to determine if a domain is hung. If a domain does not respond to the following commands from the System Controller, that is a good indication that the domain is hung.


sc0:sms-svc:1> ping <domain_name>
sc0:sms-svc:2> telnet <domain_name>
sc0:sms-svc:3> console -d <domain_id>


Now that you have established the likelihood that the domain is hung, the following steps can be used to return to a 'Running Solaris' state.

1. From the System Controller, connect to the domain through the console command.

sc0:sms-svc:1> console -d <domain_id>

Even though there won't be any response or activity, we can still send a break sequence (~#) that will drop the OS to the OK prompt, effectively a Stop-A. Once at the OK prompt the sync command will try to generate a system dump file and reboot the domain.

~#
Type  'go' to resume
{0} ok sync

If you are connecting to the system controller via SSH (Secure Shell) then the break sequence will be intercepted by SSH and the system will not drop to the OK prompt. In order to avoid this either tell SSH not to intercept the sequence by preceeding it with another tilde (ie, ~~#) or change the SSH escape character to something other than ~ using the -e option when starting ssh.

2. If a break sequence at the console is insufficient to regain control of the domain, the reset command from the System Contoller can be attempted.  This is hard on the OS and will more than likely require a fsck of any non-logging disks to boot the domain into multiuser mode.


sc0:sms-svc:2> reset -x -d <domain_id>

Note: Ensure that you use the -x option. Without it, no crash dump will be possible.

The result of this command depends on the setting of the error-reset-recovery variable in the OBP.

If error-reset-recovery=boot, the domain reboots and no core file is captured.

If error-reset-recovery=none, the domain will drop to the OK prompt.

If error-reset-recovery=sync, the domain will panic.

Unfortunately, the default for this option is boot, so it is possible that on the first occurrence of the hang, we may not be able to capture a panic.

It may take several seconds for the OK prompt to appear after issuing this command.  Once at the OBP, be sure to type the sync command so that a core file might be generated.

3. Finally, try the setkeyswitch.


sc0:sms-svc:1> setkeyswitch -d <domain_id> off
sc0:sms-svc:2> setkeyswitch -d <domain_id> on


This should recover the domain, however, this will prevent a core file being generated.


Updated by the ESG Knowledge Content Team

Advanced SF15k Hang Debugging Procedure By Daniel Ellison, PTS-HSG-Americas

Sometimes it may be interesting to delve into the cause for the hang. For this reason, an advanced procedure detailing information that can be pulled out of the hardware state on a Sun Fire[TM] 12K/15K/20K/25K system using both OBP and REDX may be interesting for some people. In this procedure, the 'redx' commands 'xir' and 'bbxir' take the place of using the SMS command 'reset(1M)'.
Prerequisites
- OBP environment must have error-reset-recovery set to "none".
- /etc/system on domain side must have set nopanicdebug = 1.
- You must have a console window open on the domain.
- Domain must have a defined dump device (dumpadm).
- Domain must have savecore enabled. It is enabled by default for Solaris[TM] 8
- Operating Environment. Check dumpadm and/or /etc/rc2.d/S75savecore.

When the domain hangs, perform the following on the SC:

1. cd /var/opt/SUNWSMS/adm/<domain_letter>/post.

Run the following script as user sms-svc. You will need to enter the processor to which XIR should be sent. You will need to run it once per processor.

#! /bin/ksh
print "Enter expander number:[0-17]:\t\c"
read E
print "Enter slot number [0 or 1]:\t\c"
read S
print "Enter proc number [0-3]:\t\c"
read P
RDATE=`/bin/date +%y%m%d`.`/bin/date +%H%M%S`
redx -c <>xirdump.$RDATE.log
port $E $S $P
shproc
xir
bbxir
shproc
lo
EOF
#####
#END#
#####


NOTE:
This is the basic script.
A much fancier one could be generated using this as a basis.

2. In the console window, the domain should notice the XIR and drop to OBP.
DSMD should also report XIR being detected in the domain messages file.

Example for proc 2:

# ERROR: error-reset-cleanup: Externally Initiated Reset has  occured. Externally Initiated Reset
{2} ok go


NOTE:
The OBP prompt tells you, in HEX, what proc dropped to OBP due to the xir. If this does NOT happen, pick a different proc and try the above script again. Otherwise, issue the following commands at OBP and save all data.

<ok> .locals
<ok> .registers
<ok> .cpu-afsr
<ok> cpu-afar@ .
<ok> .trap-registers
<ok> ctrace


3. From the redx shproc output, you can find what address the proc stopped at; for example, PC[63:6],2b'0 = 00000000 1003E84_.

You can display where exactly this is using the 'dis' Forth word.

What you type here will be recorded in the domain's console log file on the system controller.

Examples:

1003E840 dis
1003E848 dis
10029FE8 dis
1003E898 dis


You can also look up the PC value from the 'bbxir' output that is listed for trap type 0x03 in the tl list and disassemble that address as well. For example, IF YOU SEE THIS:


tl: 1
tt tstate tpc tnpc
0x03 0x0080001600 00000000.101C1740 00000000.101C1744

THEN YOU TYPE THIS:

101C1740 dis

This will tell you where the CPU was sitting when it received the XIR (XIR trap type = 3).

4. Type 'sync' at OBP to attempt to get a core.

5. setkeyswitch off

6. setkeyswitch on - recover domain

7. Run Sun[TM] Explorer on the system controller and domain.

These, along with the core file, will collect all of the relevant data for analysis. Make sure the core file is sent along with the Sun Explorer data collections.
                       
hung, hang, recover, 12K, 15K, 20K, 25K

Previously Published As 48138

Change History
Date: 2007-10-09
User Name: 29589
Action: Reassign
Comment: IBIS migration work.
I am republishing this document so that updates are not lost. This document MUST be moved out of draft, or the published document will not migrate. This update was requested 10 months ago, yet was never completed. Will need to be done post migration if still necessary.
Version: 0

Date: 2006-12-22
User Name: 28723
Action: Rejected
Comment: The change for reset is fine.
Please expend on the break chararcter and who "eats" the ~ needed for a break.
I'm currently too busy to experiment myself, please cover ssh, telnet, rlogin, and combinations of it. Some need an extra ~ to successfullly send the break, some don;t.
Feel free to assign to me for review when you;re done
Version: 0




Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback