Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1017844.1
Update Date:2011-02-03
Keywords:

Solution Type  Problem Resolution Sure

Solution  1017844.1 :   Sun Fire[TM] MidRange Server I/O Board (IB) power supply failures.  


Related Items
  • Sun Fire 4810 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Netra 1290 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire E6900 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Netra 1280 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  
  • GCS>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  

PreviouslyPublishedAs
229081


Applies to:

Sun Fire 3800 Server - Version: Not Applicable and later   [Release: N/A and later ]
Sun Fire 4800 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire 4810 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire 6800 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire E4900 Server - Version: Not Applicable and later    [Release: N/A and later]
All Platforms

Symptoms

Symptoms
This document pertains to I/O Board (IB) power failures in Sun Fire[TM] servers.

These boards can be fail with scenarios similar to the following (note that the voltages reported will differ from case to case, and so to the IB location):

Main-SC:SC> poweron ib9
Dec 07 12:13:20 Main-SC Platform.SC: Attempt to power up /N0/IB9 failed: /N0/IB9 1.5V DC failed, observed: 0.0 volts /N0/IB9 3.3V DC failed, observed: 0.58 volts /N0/IB9: powered on
showenvironment may report ERROR LOW for the particular device and should report how low the voltage is as well, for example look at IB9 below:
sc% showenvironment -v
Slot Device Sensor Min LoWarn Value HiWarn Max Units Age Status
------- ---------- ------------ ------ ------ ------ ------ ------ --------- ------- ------
***** Results truncated for this example *****
/N0/IB7 Board 0 1.5 VDC 0 1.35 1.42 1.49 1.57 1.65 Volts DC 8 sec OK
/N0/IB7 Board 0 3.3 VDC 0 2.97 3.13 3.31 3.46 3.63 Volts DC 8 sec OK
***** Results truncated for this example *****
/N0/IB9 Board 0 1.5 VDC 0 1.35 1.42 0.0 1.57 1.65 Volts DC 8 sec *** ERROR LOW ***
/N0/IB9 Board 0 3.3 VDC 0 2.97 3.13 0.58 3.46 3.63 Volts DC 8 sec *** ERROR LOW ***

Errors seen in operation resulting in a domain outage may be like this:

Mar 09 14:12:30 Sunfire Platform.SC: [ID 920508 local0.notice] CPCI I/O Board (F3800) at /N0/IB8 Device poll caused: sun.serengeti.FailedHwException: (SdcAsic)Asic.getTemp: Path broken between CBH and SDC: IB8.sdc.10 (13000010)

or also like this:

Mar 09 14:12:31 Sunfire Platform.SC: [ID 818977 local0.notice] /N0/IB8, sensor status, outside acceptable limits (7,1,0x503080d00050000)

or perhaps like this:

Mar 18 20:27:45 Sunfire Platform.SC: Device voltage problem: /N0/IB8 abnormal state for device: Board 0 1.5 VDC 0 Value: 0.0 Volts DC JtagController.tapWait: sun.serengeti.CommException: Path broken between CBH and SDC: IB8.sdc.b0 (130000b0)

Lastly, an error of this type in POST may appear as follows:

Hardware error occurred during Interconnect testing: Sun.serengeti.HpuFailedException: RepeaterHpu.verifyInterConnect: Slot 8: sun.serengeti.FailedHwException: Asic.getDeviceID: /partition0/domain0/IB8/ar0: sun.serengeti.CommException: Path broken between CBH and SDC: IB8.ar.0 (13080000): PCI I/O Board at /N0/IB8
Mar 18 20:29:01 Sunfire Domain-A.SC: Excluded unusable, unlicensed, failed or disabled board: /N0/IB8

NOTEs:

  • The symptoms described above are not an exhaustive list of messaging related to this issue. It is suspected that the real list of possible errors associated to this issue is very long, but the common symptom seems to be the voltage and power related errors shown above (and especially Path broken between CBH and SDC alerts). 
  • This document also applies to Sun Fire[TM] v1280, E2900, and Netra 1280, 1290 systems but are not specifically listed due to the fact that they utilize IB_SSC boards (not IBs).  If this event takes place on this server type, the IB_SSC is implicated but the System Controller (which is integrated) is not reachable to even see the errors.
  • Contact Oracle Support Services if you are unsure whether this document applies to your particular situation.

Cause

This error indicates a power failure of the I/O Board.

The board needs to be replaced.

Solution

In order to replace the board, as a customer, you need to open a Service Request with Oracle Support Services and schedule to have the IB or IB_SSC replaced. 
  • Please provide the showlogs -v or console log output from the Main SC showing the error messages as well as showenvironment output which confirms the board's voltage.
  • Also providing showboards command output helps an engineer validate the I/O Board type.
  • Refer to this knowledge article so the engineer is quickly able to validate the error messages.

Recommended Action Plan for Sun Support Services Engineers:

1) Validate the error messages are as described in this article.  If there are multiple boards showing this error at the same time, escalate the issue instead of proceeding.

2) Verify the type of IB or IB_SSC involved (see showboards output which will identify the I/O Board as PCI, PCI+, or PCI-X).

3) Make sure the System Controller (SC) is at ScApp 5.20.3 or higher (prefer HIGHER) to avoid CR 6300392 if the configuration includes adjacent domains.  See "Additional Information" section of this article for details.

4) Dispatch the IB or IB_SSC replacement per normal process.

Additional Information

"Adjacent Domain Issue"

Testing has shown that an I/O Board failure can result in an outage of an adjacent domain if your version of ScApp is below 5.20.3.

  • Adjacent domains are those domains which operate within the same partition - Domains A & B are adjacent or C & D.
  • This behavior is CR 6300392 and resolved in ScApp 5.20.3.
    • The failure is reproduced by forcing an I/O Board power failure (which we did in lab testing) and then rebooting the adjacent domain.  The rebooted domain eventually encounters an L2CheckError event (fatal error thought to be a hardware issue).
    • On it's own, the L2CheckError appears to be a legitimate hardware fault and would prompt replacement of a hardware component.

We found that the timing of the adjacent domain reboot is inconsequential. The domain could be rebooted 10 minutes or 10 months following the IB failure and the result would be the same. Fortunately, we confirmed a workaround to avoid this situation (prior to resolving it via ScApp update):

Reboot the Main SC following an IB failure and before the adjacent domain is rebooted.

So, in summary, if you encounter an IB Power failure, and you have adjacent domains, and you are on a version of ScApp LESS then 5.20.3 - reboot the Main SC proactively to avoid an adjacent domain issue.

The better advice is to upgrade your version of ScApp to avoid the issue altogether via patch 114527.


Internal Information

References
BugID 6401739
The part number (non-FRU) of the D108 power supply is 300-1345.
D108 information:
The main issue discussed within this document is known as the D108 Power Supply failure.
The D108 is the DC-DC converter on the PCI boards for these servers.  If there is any concern
mentioned with regards to repeat I/O Board power failures, make sure to confirm that the
replacement board is at least part number 540-4616-05 or at least part number 540-4591-04
depending on which part is needed.  Escalate to the next level of technical support if you have
questions with regards to this document.

Previously Published As 83696

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback