Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1008923.1
Update Date:2010-09-01
Keywords:

Solution Type  Technical Instruction Sure

Solution  1008923.1 :   Sun Enterprise[TM] 3x00-6x00 servers: Data collection advice for unplanned system reboots  


Related Items
  • Sun Enterprise 4500 Server
  •  
  • Sun Enterprise 5500 Server
  •  
  • Sun Enterprise 3500 Server
  •  
  • Sun Enterprise 6500 Server
  •  
  • Sun Enterprise 4000 Server
  •  
  • Sun Enterprise 6000 Server
  •  
  • Sun Enterprise 3000 Server
  •  
  • Sun Enterprise 5000 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  

PreviouslyPublishedAs
212278


Applies to:

Sun Enterprise 3500 Server
Sun Enterprise 4500 Server
Sun Enterprise 5500 Server
Sun Enterprise 6500 Server
Sun Enterprise 3000 Server - Version: Not Applicable and later    [Release: NA and later]
All Platforms

Goal

The goal of this document is to provide suggestions for what type of data needs to be collected if a Sun Enterprise 3x00, 4x00, 5x00, or 6x00 system encounters an unplanned system reboot.  Unplanned refers to any number of ways one might describe the system suddenly resetting (so you might refer to this as a crash, panic, reset, or some other name). 
Each of these names actually has different technical meanings.  The whole point of this article is to set expectations on the data needs that are required to determine root cause to the event that was encountered.  This article will not describe the actual resolution to the event - just how and what data is needed in order to allow the support engineer to perform the diagnosis.
Background:
Sun Enterprise 3x00-6x00 servers may experience unplanned reboots for different reasons. Basic failure analysis methodology might entail the need to transfer large core files for analysis, but this is not necessary in some cases.   While this document will NOT completely eliminate the need for core file analysis, the goal is to reduce diagnosis time of certain failures where errors are sufficient, and transfer of large core files is unnecessary.

Solution

What to do after a system encounters an "Unplanned Reboot":

First, assuming the system is recovered, look in the /var/adm/messages file and try to validate what type of system event was encountered.  On this type of platform there are a few events that cause the majority of unplanned reboots:

  1. Normal reboots
  2. Power failures
  3. Fatal Reset (or Fatal Error) event
  4. Solaris Panics (also known as crashes)

Normal Reboots

In /var/adm/messages, a "normal reboot" will simply show messaging that indicates that a system is being rebooted. 
  • You might see something saying, "Rebooted by root" or might not. 
  • If you are lucky, you can look on the system's console (assuming you have access to console logs) and see who logged in prior to the reboot and simply ask them if they rebooted the system (although, it's something that requires root access). 
  • The messages file should not contain any messaging regarding "panic", "System booting after fatal error FATAL".
  • Usually a normal reboot will not require any fsck execution (but this is NOT a hard rule).
It is ironic how often "unplanned reboots" are reported that turn out to be normal reboots executed by some administrator that has not communicated to his/her colleagues that the system needed to be rebooted.  So, if not seeing any error messages and thinking this may be a normal reboot, ask around and see if anyone performed the system reboot.  You might be surprised.

If no errors are seen and you can not validate that anyone purposely rebooted the system, please collect Explorer data from the system and console logs if possible.  Provide this data to support and an attempt will be made to identify what happened. 
If a core file was generated at the time of this reboot, the issue is a panic, so proceed to that section below.

Power Failures

A power failure usually leaves very little traces of any event taking place in the messages file. 
  • If you have a system that is logging frequent messaging, you will notice the messaging stop suddenly.  The file will simply end at a certain point (log no error messages at all).
  • If the system is recovered (powered on again) power on messages will be the very next messaging you should expect to see in /var/adm/messages.
  • It would be normal in a power failure situation for file systems to require fsck and a fair amount of work to recover the system (due to the sudden loss of power in the midst of operations). 
  • If more then one machine is affected, the source of this unplanned reboot is usually known because more than one machine is affected (and can be observed by administrators).
When unsure as to the source of the failure, but suspecting it to be power related and only a single system is affected, please collect Explorer data from the system and console logs if possible.  Provide this data to support and an attempt will be made to identify what happened. 

Fatal Reset (aka Fatal Error) event

You can easily validate if a system has rebooted due to a Fatal Reset event because the /var/adm/messages file will show the following message:
     System booting after fatal error FATAL
A Fatal Reset or Fatal Error event is a hardware fault that affects system integrity.  Fatal Resets will not generate Solaris core files, and error analysis will depend primarily upon the messages captured from the server's system console.

Data Requirements to diagnose a Fatal Reset event are:
  • Console Logs
  • Output from /usr/platform/sun4u/sbin/prtdiag -v
Fatal Resets can be identified by looking at console log output for Fatal Reset messages, followed by the system performing a Power On Self Test (POST). An example of console output from a Fatal Reset is as follows:
       Fatal Reset
0,0> FATAL ERROR
0,0> At time of error: System software was running.
0,0> Diagnosis: Board 15, backplane pins, board connector pins, AC
0,0> Log Date: September 10 2:0:9:34 GMT 2001
0,0> RESET INFO for IO Type 4 board in slot 15
0,0> AC ESR 00002000.00000000 FTA_PERR
0,0> DC[0] 00
0,0> DC[1] 00
0,0> DC[2] 00
0,0> DC[3] 00
0,0> DC[4] 00
0,0> DC[5] 00
0,0> DC[6] 00
0,0> DC[7] 00
0,0> FHC CSR 00040000 LOC_FATAL
0,0> FHC RCSR 02000000 FATAL
If console output is not available, it is suggested to obtain an Explorer from the system in question.  It is possible (if lucky) for diagnosis of the event to take place utilizing this data, but it's not guaranteed.  For this reason, configuring a console loghost is not a suggestion, it's a necessity.

Solaris Panic (crash, core dump, etc)

If a reboot was the result of a panic, some diagnostic determination regarding the nature of the panic can be made using the messages available in /var/adm/messages. While a full analysis of the corefile is always preferred, Solaris panics that are the result of multi-bit ECC hardware errors usually leave messages which are sufficient to provide a diagnosis with a reasonable level of certainty.  In these situations it is not always necessary to provide support with the core file.

To determine if the source of the error is hardware ECC related, look for one of the following errors in the /var/adm/message file or console log:
  • EDP Event - Ecache Data Parity Event
  • WP Event - Writeback Data Parity Error
  • CP Event - Copyout Data Parity Error
  • UE Event - Uncorrectable Memory Error
  • BERR Event - Bus Error
  • CE Event - Correctable Memory Error
Appearance of these acronyms will be in a message similar to the following:
WARNING: [AFT1] EDP event on CPU1 Instruction access at TL=0, errID 0x0000ad88.6cd9989f
AFSR 0x00000000.80408000 AFAR 0x00000000.0f0c8080
AFSR.PSYND 0x8000(Score 95) AFSR.ETS 0x00 FAULT_PC 0x780b481c
UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00
If you do see error messaging similar to above, the support engineer really only needs an Explorer data file to proceed with the diagnosis.  The core file is usually not required.
If you do not see an error similar to the above message, or you are unable to confirm this for some reason, the core file will need to be analyzed to determine the reason for the Solaris panic.

Failure determinations based on /var/adm/messages alone can only be made when one of the above acronyms appears in the messages file.

Internal Only Information on memory errors above:
For failures that indicate a memory error event such as those listed above,
the system provides limited interpretation of the failure which can aid in the
diagnosis of the suspect component.
For each error, each component indicated in the error is assigned a "Score"
value between 5 and 95. The higher the score, the higher the probability that
the part indicated is at fault. A part which is implicated with a "Score" of 95
should be considered the primary candidate for replacement, unless multiple
parts are assigned a "Score" of 95. In the example above, CPU1 was the only
part assigned a " (Score 95) ".

Best Practices - the short version:
Mirrored E-Cache (Sombra)   Swap after first failure.
Unmirrored E-Cache   Swap after second failure.
Cediag and Findaft enforce these rules.  Follow their recommendations.

Internal Reference:
http://ittdev.east.sun.com/TechTalk/Fatal/

Previously Published As
50348


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback