Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1008001.1
Update Date:2010-01-26
Keywords:

Solution Type  Troubleshooting Sure

Solution  1008001.1 :   Sun Blade[TM] 8000 Chassis Monitoring Module Fault Diagnosis  


Related Items
  • Sun Blade X8400 Server Module
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Blade Servers
  •  

PreviouslyPublishedAs
211036


Description
The Sun Blade[TM] 8000 Chassis Monitoring Module generates a fault message when a particular chassis error condition results in a fault diagnosis from the CMM ILOM.

This document is designed to work in conjunction with the fault message to determine the correct course of action. This document will be updated as the CMM firmware is updated.

The Fault message generated from the CMM will contain the URL to this document.

Message ID Severity Synopsis
15.3.1 Major Chassis High Software Shutdown
15.3.2 Minor Chassis High Temparature Warning
15.3.3 Minor Chassis Low Temparature Warning
15.3.4 Major Chassis Low Software Shutdown
15.4.7.2 Major Blade CPU Heatsink Over-temperature
15.7.1 Minor Power Supply not seeing AC power
15.7.2 Major 48V over-voltage
15.7.3 Major Power Supply Fan Failure
15.7.4 Major Power Supply Over Temperature
15.7.5 Major Power Supply 48V output is less than 0.1 ohms
15.7.6 Major Power Supply has detected an over-current condition on the 48V output

The articles can be searched on using the Message ID provided in the error message.



Steps to Follow
Article for Message ID: 15.3.1


Chassis High Software Shutdown

Type
Fault

Severity
Major

Description
Chassis Ambient Temperature is out of range..

Automated Response
CMMs will communicate with all Blade SPs to force a shutdown.

Impact
The System chassis and all Blades have shutdown. CMMs continue to run.

Suggested Action for System Administrator
Return the room ambient temperature to operational ranges (<43C )

Details
When the CMM Chassis Ambient sensor is at or above the Chassis High Software Shutdown threshold, the CMM will communicate with each blade's SP and force a shutdown, followed by a full chassis power-off. Blade SP's should attempt a graceful shutdown first. If the graceful shutdown has not completed after 2 minutes, the SP will force an ungraceful shutdown with a 4-second ACPI button press if the blade fails to shut down gracefully. CMM software should also display appropriate fault information on the web and CLI interfaces, and send IPMI and SNMP alerts to indicate the fault. Note that there is no way to power off the CMMs, so they will continue to run indefinitely as long as the PSUs are connected to AC mains. It is expected that most blades will have already shut down under their own thermal policies by the time the Chassis High Software Shutdown threshold is reached, but in case they have not, the SC shall communicate to all blade SPs and attempt a graceful shutdown of each blade prior to powering-off all the chassis main PSUs. Also note that the chassis Service Required indicators (front and rear) should already have been illuminated in response to the Chassis High Warning threshold being reached, as described below.

Return to Top

Article for Message ID: 15.3.2

Chassis High Temparature Warning

Type
Fault

Severity
Minor

Description
The CMM is detecting that Room Ambient temperature is above Normal Operating ranges.

Automated Response
Warning messages are being sent indicating that Room Ambient temperature is above normal operating ranges

Impact
System operation has not been impacted yet but it may be if Room Ambient Temperature is not restored to normal operating ranges.

Suggested Action for System Administrator
Return the room ambient temperature to normal operating ranges (<36C)

Details
When the CMM Room Ambient Sensor is at or above the Chassis High Warning threshold, the CMM software will illuminate the chassis front and rear Service Required indicators and display appropriate warnings on CMM CLI and web interfaces and send appropriate SNMP and IPMI alerts to indicate the fault. No other actions are needed with regard to this fault.

Return to Top

Article for Message ID: 15.3.3

Chassis Low Temparature Warning

Type
Fault

Severity
Minor

Description
The CMM is detecting that Room Ambient temperature is below Normal Operating ranges.

Automated Response
Warning messages are being sent indicating that Room Ambient temperature is below normal operating ranges

Impact
System operation has not been impacted yet but it may be if Room Ambient Temperature is not restored to normal operating ranges.

Suggested Action for System Administrator
Return the Room Ambient temperature to normal operating ranges (>4C)

Details
When the CMM Room Ambient Sensor is at or below the Chassis Low Warning threshold, the CMM software will illuminate the chassis front and rear Service Required indicators and display appropriate warnings on CMM CLI and web interfaces and send appropriate SNMP and IPMI alerts to indicate the fault. No other actions are needed with regard to this fault.

Return to Top

Article for Message ID: 15.3.4

Chassis Low Software Shutdown

Type
Fault

Severity
Major

Description
Chassis Ambient Temperature is out of range..

Automated Response
CMMs will communicate with all Blade SPs to force a shutdown.

Impact
The System chassis and all Blades have shutdown. CMMs continue to run.

Suggested Action for System Administrator
Return the room ambient temperature to operational ranges (>-2C )

Details
When the CMM Chassis Ambient sensor is at or above the Chassis High Software Shutdown threshold, the CMM will communicate with each blade's SPs and force a shutdown, followed by a full chassis power-off. Blade SP's should attempt a graceful shutdown first. If the graceful shutdown has not completed after 2 minutes, the SP will force an ungraceful shutdown with a 4-second ACPI button press if the blade fails to shut down gracefully. CMM software should also display appropriate fault information on the web and CLI interfaces, and send IPMI and SNMP alerts to indicate the fault. Note that there is no way to power off the CMMs, so they will continue to run indefinitely as long as the PSUs are connected to AC mains. It is expected that most blades will have already shut down under their own thermal policies by the time the Chassis High Software Shutdown threshold is reached, but in case they have not, the SC shall communicate to all blade SPs and attempt a graceful shutdown of each blade prior to powering-off all the chassis main PSUs. Also note that the chassis Service Required indicators (front and rear) should already have been illuminated in response to the Chassis High Warning threshold being reached, as described below.

Return to Top

Article for Message ID: 15.4.7.2

Blade CPU Heatsink Over-temperature

Type
Fault

Severity
Major

Description
The Blade SP has determined that a CPU temperature is abnormally high.

Automated Response
The Blade"Service Required" Led has been illuminated by Blade SP.

Impact
The CPU cannot be relied on to execute code reliably and the Blade has been powered down .

Suggested Action for System Administrator
The Blade must be serviced. This will require the CPU heat sink to be re-seated.

Details
The CPU temperature has exceeded normal operating limits. The most likely cause is that the Blade CPU heatsink interface has been compromised. That is, the physical connection between the CPU and the heatsink has lost integrity. The heatsink must be removed. Both surfaces of the CPU and the heatsink must be cleaned, thermal grease re-applied and the heatsink carefully re-attached to the CPU as per the instructions in the Sun Blade 8000 Modular System Installation Guide. (check reference)

Return to Top

Article for Message ID: 15.7.1

Power Supply not seeing AC power

Type
Fault

Severity
Minor

Description
The CMM has determined that AC input power is not present.

Automated Response
The chassis service LED has been lit by the CMM to indicate the fault. The affected power supply AC OK LED should be turned on by the power supply.

Impact
Redundancy has been lost. There may not be enough power available for all componements in the chassis.

Suggested Action for System Administrator
Check that power cord is plugged into power supply. Check power cord is plugged into socket. Check that power is being supplied to the Chassis. Check power in external environment.

Details
The lack of AC power to a PSU is considered a fault condition external to the chassis. Like external ambient air temperature faults, the ?AC Not Present? fault is automatically repaired when AC power is restored to the Power Supply in question. When the problem is corrected the power supply will detect that AC power has been restored. The CMM will detect AC OK and automatically repair the fault.

Return to Top

Article for Message ID: #15.7.2

48V over-voltage

Type
Fault

Severity
Major

Description
The Power Supply has detected an over-voltage condition on it's 48V output stage.

Automated Response
The Power Supply has illuminated the "Service Required" indicator.

Impact
The Power Supply has disabled the 48V output and powered itself down.

Suggested Action for System Administrator
If the Power Supply Service required LED remains lit, this power Supply must be replaced.

Details
The Power Supply has detected an over-voltage condition on it's 48V output stage, has illuminated it's "Service Required" indicator, extinguished the DCOK indicator and powered down. The CMM will attempt to reset the Power Supply and Start it. If the internal condition persists, the "Service Required" indicator will remain lit and the Power Supply will have to be replaced.

Return to Top

Article for Message ID: 15.7.3

Power Supply Fan Failure

Type
Fault

Severity
Major

Description
The Power Supply has detected that it's fan is running at zero or near zero RPM.

Automated Response
The Power Supply "Service Required" Led has been illuminated by the Power Supply.

Impact
The Power Supply will power off, light it's Service Required indicator and extinquish it's DCOK indicator.

Suggested Action for System Administrator
If the Power Supply Service required LED remains lit, this power Supply must be replaced.

Details
The Power Supply has detected that it's fan is at near zero RPM and has powered itself off. The fault may be spontaneously cleared by the CMM but if the condition persists, the Power supply will remain off, the Service Indicator will be illuminated and the Power Supply must be replaced.

Return to Top

Article for Message ID: 15.7.4

Power Supply Over Temperature

Type
Fault

Severity
Major

Description
The Power Supply has detected an over-temperature condition on it's internal sensor.

Automated Response
The Power Supply has powered off it's 48V output. Other Automated Responses by the Power Supply differ depending on varying conditions. If ambient air temperature is within normal operating limits, the Power Supply Over temperature condition is diagnosed as a Power Supply fault. If ambient air temperature is not within normal operating limits, the Power Supply Over Temperature condition should be ignored. "Service Required" Led has been illuminated by the Power Supply.

Impact
The Power Supply has attempted to clear the fault and resume operation. If this condition is a short circuit in the Power Supply, the Supply will fail and power off.

Suggested Action for System Administrator
If the Power Supply Service required LED remains lit, this power Supply must be replaced.

Details
The PSU design incorporates thermal protection to prevent damage due to overheating. If there is an over-temperature condition internal to the PSU, defined as 45?C for > 5 seconds measured at an internal sensor, it will power off its 48V output, and inform the CMM of the overtemperature condition via the RS-485 interface. If the ambient air-temperature is within normal operating limits, then the PSU over-temperature condition should be diagnosed as a PSU fault, and ignored if the ambient air temperature is over the 35?C limit. Note that a PSU fan failure, which could otherwise cause an overtemperature condition, is separately diagnosed. Similarly, a chassis over-temperature condition is separately diagnosed as well. In the over-temp condition, the PSU disables its 48V output, illuminates its Service Required indicator, and extinguishes its DCOK indicator. Also, the PSU remembers that it is in the ON state even though the 48V rail is OFF. Also, this fault condition is self-clearing in that the PSU will extinguish its Service Required indicator and re-enable its 48V output when the temperature drops below 45?C for > 5 seconds.

Return to Top

Article for Message ID: 15.7.5

Power Supply 48V output is less than 0.1 ohms

Type
Fault

Severity
Major

Description
The Power Supply has detected an under-ohmage condition on it's 48V output.

Automated Response
The Power Supply Service Required indicator and DCOK is turned off.

Impact
The Power Supply has attempted to clear the fault and resume operation. If this condition is a persists in the Power Supply, the Supply will fail and power off.

Suggested Action for System Administrator
If the Power Supply Service required LED remains lit, this power Supply must be replaced.

Details
The Power Supply has detected an under ohmage condition on it's 48V output. The fault may spontaneously clear. However if this is a short circuit condition within the Power Supply, the power supply will fail and power down. ThePower Supply must be replaced.

Return to Top

Article for Message ID: 15.7.6

Power Supply has detected an over-current condition on the 48V output

Type
Fault

Severity
Major

Description
The Power Supply has detected an over-current condition on it's 48V output.

Automated Response
The Power Supply Service Required indicator has been illuminated by the Power Supply and the DCOK indicator has been extinguished.

Impact
The Power Supply has stopped delivering power but remains on.

Suggested Action for System Administrator
The Power Supply has drawn too much current, probably sue to a short circuit in and/or around the Power Supply. Try re-seating the Power Supply. Look for any damage on the Power Supply connector and the Midplane connector. If the Power Supply Service required LED remains lit, this power Supply must be replaced.

Details
The Power Supply over-current protection has been invoked because the 48V output current has remained above 68 Amperes for greater than five (5) seconds. The likely cause of the over-current condition is a short circuit in and/or around the Power Supply. Look for any damage on the Power Supply connector and the Midplane connector. If the Power Supply Service required LED remains lit, this power Supply must be replaced.

Return to Top



Product
Sun Blade 8000
Sun Blade x8400 Server Module

Internal Comments
For the internal use of Sun Employee's.

This has been temporarly filed under Sun Fire[TM] X4600 (G4) server, until an Andromeda server (Sun Blade 8000 Modular System & Sun Fire[TM] X8400 Server Module blade) category is available.


This document will be linked to via the URL entry in the Fault Management message from the CMM. Additional fault diagnosis messages will be added to this document by PTS or Engineering as additional error condition diagnosis engines are developed and made available via CMM firmware updates.


The document was originally authored by Fred Waible (PTS Serviceability).


Blade, 8000, Andromeda, CMM, X8400, Fire, fault, error, diagnosis
Previously Published As
85878

Change History
Date: 2006-09-19
User Name: 7058
Action: Update Canceled
Comment: *** Restored Published Content *** Removing working copy because I simply added a metatag.
Version: 0
Date: 2006-09-19
User Name: 7058
Action: Update Started
Comment: Trying to figure out why SSH isn't picking this doc up.
Version: 0
Product_uuid
5b5f4739-c9c2-11da-857a-080020a9ed93|Sun Blade 8000
05ce1a70-c0f3-11da-857a-080020a9ed93|Sun Blade x8400 Server Module

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback