Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1019667.1
Update Date:2011-04-25
Keywords:

Solution Type  Problem Resolution Sure

Solution  1019667.1 :   Sun Fire[TM] Server System Board (SB) voltage errors.  


Related Items
  • Sun Fire E2900 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Netra 1280 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Netra 1290 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Fire E6900 Server
  •  
  • Sun Fire 4810 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  
  • GCS>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  

PreviouslyPublishedAs
243326


Applies to:

Sun Fire 4800 Server - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Sun Fire E2900 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun Fire 3800 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun Fire 6800 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun Fire E6900 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
All Platforms

Symptoms

This document describes how to identify and resolve Sun Fire[TM] Server System Board (SB) voltage errors.  The servers included in this product family are as follows:
  • Serengeti -  Sun Fire[TM] 3800, 4800, 4810, 6800, E4900, E6900
  • Lw8 - Sun Fire[TM] v1280, E2900 and Netra[TM] 1280, 1290
This document does not address I/O Board (IB) voltage errors.  See Document 1017844.1 if you have similar errors associated to an IB.

Error Messages:
Look for these "Key Indicators" of a voltage issue in error messaging in the System Controller's (SC) log files (showlogs) or console:
Path broken between CBH and SDC
Device voltage problem: /N0/SB
#

Attempt to power up /N0/SB
# failed

/N0/SB#, sensor status, outside acceptable limits 


(where # is the board number)

Some examples of those "Key Indicators" from actual failure messaging found in showlogs files:

Fri Sep 26 11:45:46 sc lom: [ID 360430 local0.error] Device voltage problem: /N0/SB0 abnormal state for device: Board 0 3.3 VDC 0 Value: 0.37 Volts DC
Fri Sep 26 11:45:46 sc lom: [ID 322610 local0.notice] CPU Board V3 at /N0/SB0 Device poll caused: sun.serengeti.FailedHwException: (SdcAsic)Asic.getTemp: Path broken between CBH and SDC: SB0.sdc.10 (12000010)
Fri Sep 26 11:45:46 sc lom: [ID 336982 local0.notice] Device will not be polled
Fri Sep 26 11:45:46 sc lom: [ID 664082 local0.notice] CPU Board V3 at /N0/SB0 Device poll caused: sun.serengeti.FailedHwException: (ArAsic)Asic.getTemp: Path broken between CBH and SDC: SB0.ar.10 (12080010)
Fri Sep 26 11:45:46 sc lom: [ID 336982 local0.notice] Device will not be polled


Sat Sep 27 06:16:24 sc lom: [ID 395834 local0.error] Attempt to power up /N0/SB0 failed: /N0/SB0 3.3V DC failed, observed: 0.15 volts
Sat Sep 27 06:16:25 sc lom: [ID 503827 local0.error] sun.serengeti.HpuFailedException: CPU Board V3 at /N0/SB0
Sat Sep 27 06:16:25 sc lom: [ID 889337 local0.notice] sun.serengeti.CommException
Sat Sep 27 06:16:29 sc lom: [ID 304509 local0.error] No usable Cpu board in domain.


Wed Oct 01 21:56:10 sc lom: [ID 390680 local0.notice] CPU Board V3 at /N0/SB0 Device poll caused: sun.serengeti.HpuFailedException: CpuVoltageA2D.getOutputVoltage: sun.serengeti.CommException: I2cComm.readCmd:  Path broken between CBH and SDC: SB0.sbbc1.regs.c0 (102000c0)
Wed Oct 01 21:56:10 sc lom: [ID 336982 local0.notice] Device will not be polled
Wed Oct 01 21:56:10 sc lom: [ID 120592 local0.notice] /N0/SB0, sensor status, outside acceptable limits (7,1,0x207000d00070000)
All examples above showed SB0, but the board in question could be any SB in the system and the errors would generally be similar. 

The showenvironment command may also show an "ERROR LOW" status for the SB and a 3.3 VDC sensor value of 0.xx (in other words, less then the LoWarn value).
  • This output is extremely useful for diagnosing power failure events when showlogs data has scrolled off the error buffer (when the other symptoms described below are not present in error logs for any number of reasons).
  • The example below indicates an ERROR LOW status for SB0 and SB2 is also provided to show what "normal" values are in relation:
lom> showenvironment -v
Slot Device Sensor Min LoWarn Value HiWarn Max Units Age Status
------- ---------- ------------ ------ ------ ------ ------ ------ --------- ------- ------
***** Results truncated for this example *****
/N0/SB0 Board 0 3.3 VDC 0 2.97 3.13 0.49 3.47 3.63 Volts DC 5 min *** ERROR LOW ***
***** Results truncated for this example *****
/N0/SB2 Board 0 1.5 VDC 0 1.35 1.42 1.51 1.58 1.65 Volts DC 9 sec OK
/N0/SB2 Board 0 3.3 VDC 0 2.97 3.13 3.27 3.47 3.63 Volts DC 9 sec OK

Expected Behavior:
When a server encounters an SB voltage error and the domain is not yet booted or in operation the domain it is part of will either fail POST tests, domain or board poweron, a Keyswitch operation, or fail to boot properly.  If the domain is already in operation it will crash when a SB encounters a voltage issue.

If the domain crashes, showlogs data might indicate all sorts of Parity Error events as having taken place, such as an Address Parity Error, Parity Bidi Event, L2CheckError Event, or more.  The most important thing to note is that when a domain crashes in addition to one of the key following errors, the root cause is likely to be a voltage issue which caused the Parity Error event - not the other way around.  See the Additional Information section of this article for an example.

Cause

If you encounter the errors described previously, you need to contact Oracle Support Services and open a service request.

Resolution:
An error of this nature is caused by a defective power supply located directly on the System Board (called the D150).  This is a factory repaired component.

The resolution is to replace the System Board.

Solution

Customers should do the following:

A service request is required to schedule the SB replacement.  Customers should contact Support Services and create a service request to resolve this problem.
  • To speed up resolution, make sure to have the appropriate data available for the support engineer:
    • Sun Fire v1280, E2900, or Netra 1280, 1290 require 1280extended Explorer; 
      See Document: 1019066.1 for details
    • Sun Fire 3800-E6900 systems require scextended Explorer; See Document: 1019066.1 for details
  • Mention this knowledge article to speed up the resolution process. 
  • A Sun Field Engineer will be dispatched to take care of the SB replacement after the error is confirmed per normal process.
  • The Main SC should be rebooted following the board replacement.
    • This is to avoid the chance of encountering Sun CR 6777187 (symptoms similar to CR 6300392) observed from some ScApp 5.20.3 ( or greater ) installations.
NOTE: For Sun Fire v1280, E2900, Netra 1280, or 1290, also implement the Additional Cooling Action advice detailed in Sun Alert 1021703.1.

Internal Support Services instructions are listed in the Internal Section of this article.

Additional Information

Parity Errors Caused by Voltage Errors
As described above, SB voltage errors can cause Parity Error events to occur if a domain is operational when a SB goes bad.  Root Cause is the SB voltage issue.  It's important to understand this because the voltage problem can cause an address or data parity error (or any number of other strange looking error events) due to a board suddenly disappearing.  A parity (or data, or l2check, or other) error does not cause a voltage problem - a voltage problem DOES cause all of the above.

The following is an example of just one type of error you might seen: 
Wed Oct 01 21:55:56 sc lom: [ID 385625 local0.error]
/RP0/ar0:> SafariPortError0[0x200] : 0x00008005
                 AdrPErr [00:00] : 0x1 Address parity error
                      FE [15:15] : 0x1
                 QUnfErr [02:02] : 0x1 Queue underflow error

Wed Oct 01 21:55:57 sc lom: [ID 197878 local0.error]
Wed Oct 01 21:55:58 sc lom: [ID 841584 local0.error] [AD] Event: N1290.ASIC.AR.ADR_PERR.104a3000
     CSN:  DomainID: A ADInfo: 1.SCAPP.20.6
     Time: Wed Oct 01 21:57:50 PDT 2008
     FRU-List-Count: 2; FRU-PN: 5411384; FRU-SN: 006232; FRU-LOC: /N0/RP0
                        FRU-PN: 5406679; FRU-SN: 005914; FRU-LOC: /N0/SB0
     Recommended-Action: Service action required

Wed Oct 01 21:56:10 sc lom: [ID 390680 local0.notice] CPU Board V3 at /N0/SB0 Device poll caused: sun.serengeti.HpuFailedException: CpuVoltageA2D.getOutputVoltage: sun.serengeti.CommException: I2cComm.readCmd:  Path broken between CBH and SDC: SB0.sbbc1.regs.c0 (102000c0)
Wed Oct 01 21:56:10 sc lom: [ID 336982 local0.notice] Device will not be polled
Wed Oct 01 21:56:10 sc lom: [ID 120592 local0.notice] /N0/SB0, sensor status, outside acceptable limits (7,1,0x207000d00070000)


Collaborate with the Oracle Support Engineer if needing further explanation of this concept.  All implicated SBs involved in the error should never be replaced.  Only the one which suffered the voltage error needs to be replaced and reset the CHS Status of the other components - which are perfectly sane.  Their parity errors were a natural response to the SB which lost power "disappearing" from the configuration suddenly.


Internal Comments
INTERNAL SUPPORT ENGINEERS SHOULD DO THE FOLLOWING PRIOR TO FIELD DISPATCH:

1. Confirm the errors match what was described in the Symptoms section of this article.

2. Schedule to have the implicated System Board replaced.

3. For Sun Fire v1280, E2900, Netra 1280, or 1290 systems, make sure the customer implements the Cooling Action Advice documented in Field Action Bulletin 1021064.1 or Sun Alert 1021703.1 (contract reference) on all v1280, E2900, n1280, or n1290 systems in their environment.

4. Reset the CHS status of all other components back to 'ok'.  Instructions are in 1004879.1.

5. Execute POST testing on the configuration (after SB replacement and CHS status resets).  Reboot the Main SC to avoid any chance of encountering Sun CR 6777187 (symptoms similar to CR 6300392) observed from some ScApp 5.20.3 ( or greater ) installations.

6. Collaborate with the next level of technical support if errors persist or there are any questions with this recommendation.

Additional Information:
- For I/O Board (IB) voltage errors, see 1017844.1
- If encountering difficulty with the replacement action and specifically when trying to issue a "setkeyswitch on", see 1011267.1.
 

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback