Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1003281.1
Update Date:2011-05-18
Keywords:

Solution Type  Troubleshooting Sure

Solution  1003281.1 :   Sun Enterprise[TM] 10000: Recovering a hung domain  


Related Items
  • Sun Enterprise 10000 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
204556


Applies to:

Sun Enterprise 10000 Server
All Platforms
***Checked for relevance on 06-May-2011***

Purpose

Steps to follow to recover a hung Enterprise[TM] 10000 domain.

The aim of this document is to provide guidelines to recover a hung domain in a short period of time while maximising the chances of root cause analysis.

Last Review Date

May 5, 2011

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

--- Symptoms ---

The first step is to confirm that the domain is hung. It is possible that the domain itself is in fact running, but the end-user has the impression of a hang because the application they are using has become unresponsive.

Check ping, rlogin, telnet to the domain. If it is possible to get a session on the domain then use unix commands to determine why the domain appears to be hung, e.g. ps -elf, df -k etc.

Attempt a console connection through a netcon session on the ssp. Note any output from the console (disk errors, nfs mounts not responding, out of per-user processes, out of memory, not enough swap space to fork, etc.) Also check the netcon log for the same. If prompted, login as root so that a networked mounted home directory is not required. Try and ping to an IP address external to the system (ie. the default router). Attempt df -k, is there a hung filesystem or nfs mount? Try ps -elf, are there runaway or respawning processes? Consult Document 1001950.1 for other forensics.

--- Solution ---

If it is still possible to control the domain, but the source of the problem cannot be found, then a live savecore(1M) of the domain may be a viable option. Because the system is still running, live savecore (savecore -L) can only be used when dump and swap do not share a common device. If a dedicated dump partition is not available, see Document 1004803.1 for information on using dumpadm to assign a dedicated dump file. The output of the following 'ps' command must be captured immediately prior to running savecore and submitted to Sun along with the crash dump files for anaysis.
#/usr/bin/ps -e -o uid,pid,ppid,pri,nice,addr,vsz,wchan,time,fname
Note: Live savecore is possible on Solaris[TM] 7 and later.

 

If the domain is not accessable from the netcon session:

On the ssp:

Check that the host isn't resetting already. If so, monitor progress from the latest post log.

ssp% ps -ef | grep hpost
ssp% tail -20f $SSPLOGGER//post/post.XXXX.XXXX.log

Check platform/domain messages for current messages.

ssp% tail -20f $SSPLOGGER/messages
ssp% tail -20f $SSPLOGGER/<domain>/messages

Other commands to be run to check status are:

  • check_host - responds that host is UP or host is DOWN
  • hostinfo -h - reports the state of each processor
  • hostinfo -S - check the processors in the domain, the heartbeat number should be incrementing over time.

These are the steps to recover a hung domain and are also defined on the hostint(1M) manpage.

  1. From a netcon session on the domain:
    Send a break sequence.
    ~#
    <#Y> ok ctrace
    <#Y> ok .registers
    <#Y> ok .locals

    (Ensure the output from these commands is saved)

    <#Y> ok sync

    The sync command forces the system to use an illegal location causing a "panic: ZERO". This will write an image of system memory (panic dump) to the dump device. When the domain is booted a crash dump will be saved which can be sent to Sun along with the output of the first 3 commands for further analysis.

  2. Execute hostint. If this step is successful, the domain panics, dumps and then reboots. Give this command 5 minutes before checking its state again.

  3. Execute hostreset. If this step is successful, you will then be able to initiate a bringup (1M). The hostreset will create a file hostresetdump-MM.DD.HH:MM in $SSPVAR/adm/ and this file can be sent to Sun[TM] for further analysis.

  4. If all the above steps fail, execute bringup(1M) with the -f ("force") option. This step is a last resort and will result in clearing all hardware/software states and it will not be possible to investigate the cause of this hang.



Additional Information
Netcon commands:
~? Show status and communication path
~= Toggle - Switch to JTAG from network and network to JTAG
~. Exit out of netcon
~# L1A/Stop-A - Break to obp
~@ Get write permission for current netcon session
~* Exclusive netcon, Kill all other netcon sessions


These steps may be followed by a Sun Engineer and include additional steps and undocumented switches. These steps should be followed if there is sufficient time to attempt recovery of the domain without a hardware reset (hostreset or bringup). For each step it is recommended to wait 5 minutes before checking domain state to see if the command has succeeded:

      Steps 1 & 2 from above
3.  hostint -p X
4. sigbcmd panic
5. sigbcmd -p X panic
6. sigbcmd -I panic
7. sigbcmd -I -p X panic
8. sigbcmd obp
9. sigbcmd -p X obp
If either steps 8/9 manage to break into the obp, execute
    <#Y> ok ctrace
    <#Y> ok .registers
    <#Y> ok .locals
    <#Y> ok sync
10. hostreset
11. bringup -f -l  <level=24 min/64=max>
The -p option is for processor. X is a processor in the domain (not the boot processor). You must choose a processor from the hung domain and not another active domain. To get board numbers do:
ssp:domain% domain_status
DOMAIN TYPE PLATFORM OS SYSBDS
sun Ultra-Enterprise-10000 test 2.8  0 2 4

Example of processor numbers from a domain of boards 0 2 4:
Processors on board 0 : 0 1 2 3

Processors on board 2 : 8 9 10 11
Processors on board 4 : 16 17 18 19
board# x 4 is the first proc for that board
e.g. 2 x 4 = 8 is the first proc on board 2
Reference internal for Live Savecore gotchas: Document 1004608.1

Keywords: E10K, hung, domain, hostreset, hostint, Starfire

Previously Published As 70566



Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback