Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1008719.1
Update Date:2011-04-21
Keywords:

Solution Type  Problem Resolution Sure

Solution  1008719.1 :   Resolving Hardhang problems on Ultra Servers  


Related Items
  • Sun Enterprise 4500 Server
  •  
  • Sun Fire E6900 Server
  •  
  • Sun Fire V480 Server
  •  
  • Sun Enterprise 5500 Server
  •  
  • Sun Fire E25K Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Enterprise 3500 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire V890 Server
  •  
  • Sun Enterprise 6500 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire V880 Server
  •  
  • Sun Fire 15K Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire V490 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Operating Systems>Solaris Kernel
  •  

PreviouslyPublishedAs
211973


Applies to:

Sun Fire 4800 Server
Sun Fire 6800 Server
Sun Fire V480 Server
Sun Fire V490 Server
Sun Fire V880 Server
All Platforms

Symptoms

None of the terminals are responding, console does not respond, ping/telnet
does not respond, Stop-A does not break to OBP, "send break" from a tip line
does not break into the OBP. If all of the above are tried and fail to break
out of the hang then the system is hung. It is almost impossible
for support to figure out the cause of the hang if there is no core
file to analyze.

Cause

These steps do not provide the final solution nor detect the cause of the
hardhang, but they will help in getting a core file to analyze the problem.
In all the cases listed below, once you are in OBP type "sync" to get the
core dump. If the system was booted with kadb, then do some initial
analysis and then $q to enter OBP.

NOTE:
This document was written specifically for Sun4U architecture systems.While many of these instructions will be applicable to other architectures, some will not. XIR is only available on Ultra Enterprise systems.

Solution

Options

-------
1. Enable Deadman
2. Set Breakpoint
3. Install Hardhang Kernel
4. XIR
-----------------


1. Enable deadman timer as described in <Document 1004530.1> KERNEL: How to enable deadman kernel code

2. Set Breakpoint
-----------------
The system should have been booted with kadb. After the system comes up, get
into kadb (Stop-A/"send break") and set a breakpoint in system_high_handler().
This function is only invoked on level 15 interrupts and is associated with
fan fails and system board detection.
To set the breakpoint in kadb:

kadb: (type return)
kadb[0]: system_high_handler:b
kadb[0]: :c

When the system hardhangs again, follow the procedure described in the
section "Generating a Level 15 Interrupt".

Pros: Will succeed in some instances where 'snooping' does not.
Cons: Requires reboot if kadb not enabled.
           Requires a free system board slot.
           Cannot break a hang caused by a device other than the cpu seizing a system bus.
           Will fail if level 15 interrupts have been masked out.


3. Install Hardhang Kernel
--------------------------
A special kernel needs to be built and installed at the customer site.
Additionally, the breakpoint in system_high_handler() should be set through
kadb (see the above section "Set Breakpoint").

Now the system has been setup to break out of the hang. Should the system
hardhang, follow the procedure described in the section "Generating a Level
15 Interrupt".

Pros: Will succeed even if all the interrupts are masked.
Cons: Requires a custom kernel.
          Will fail if all the CPUs have PSTATE_IE = 0.
           Requires a free system board slot.
           Cannot break a hang caused by a device other than the cpu seizing a system bus.


4. XIR
------
This is the last resort in case the interrupts have been disabled. XIR is
a non maskable interrupt and will definitely break the system out of the
hang. Unfortunately this method also clears memory and hence a core dump
cannot be taken. But this does provide some info about the CPU state at the
time of hang.

The remote External Initiated Reset (XIR) command "Although limited in
its current form" can be used to aid Software debugging of hung systems.
Currently XIR stores the following information for each CPU:

TL (Trap Level)
TT (Trap Type)
TPC (Program Counter
TNPC (Next Program counter)
TSTATE (Trap State Register)
This information is then gathered by typing .xir-state-all in the OBP.
(You may need to Stop-A/"send break"to the machine to stop the machine from
rebooting in order to issue this command.)
There are 2 methods for initiating the XIR:

Method 1:
Press the XIR pin in the clock board which is at the rear of the E4000,
(the FE handbook notes the location of the XIR switch). To the right side
of the XIR switch is the POR switch; DO NOT press it, it will cycle power.
When XIR is pressed the system will come to the "ok" prompt (or wait
until it comes to the "ok" prompt). This method is easier than entering
the key sequences noted in method 2.

Method 2:
Press Return key (twice)
Press ~ key (once, possibly twice)
Press Control-Shift-X keys (together)

This key sequence should reboot the system. At this point, you'll need to
do a Stop-A/"send break" to get to the OK prompt.

Once the system is at the OBP prompt, get the CPU state info:
ok .xir-state-all

Pros: Will break out of the hang.
Cons: Will not be able to get a useful core file.


Generating a Level 15 Interrupt
-------------------------------
On a sun4u architecture system, a level 15 interrupt is generated when a
system board is inserted. This interrupt is also generated by a fan failure,
on both the sun4u and sun4d architectures, but since the fans are not easily
accessible, board insertion is the method described here. If, however, the
system in question is a sun4d, then disconnecting a fan will be the only
method available for generating a Level 15 interrupt.

When the system hangs, insert a system board into a free slot. This will
generate a level 15 interrupt, which should trigger the breakpoint in kadb.
Once in kadb, debugger commands can be run to examine the current state of
the system. Of particular interest are:

$r dump the registers
$c dump the current stack backtrace
freemem/D see how much memory is free

When kadb debugging is complete, attempt to take a core dump by doing:

kadb[0]: $q
ok sync

WARNING:
If a non-forced level 15 interrupt should occur on the system while the
breakpoint is set or the debug kernel is in place, then the system will
break to the OBP/kadb prompt. The system cannot be used until control is
returned to the kernel, by typing "go" at the OBP, or :c at the kadb prompt.

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback