Sun Fire[TM] 12K/15K/20K/25K: DR not possible when CSB is blacklisted and domain configured with both Centreplane halves active

Asset ID:	1-72-1009533.1
Update Date:	2010-06-18
Keywords:

Solution Type Problem Resolution Sure

Solution 1009533.1 : Sun Fire[TM] 12K/15K/20K/25K: DR not possible when CSB is blacklisted and domain configured with both Centreplane halves active

Applies to:

Sun Fire 12K Server
Sun Fire 15K Server
Sun Fire E20K Server
Sun Fire E25K Server
All Platforms

Symptoms

Symptoms
It has been observed that when a CSB (Centreplane Support Board) loses one of
it's redundant power supplies, attempting to DR (Dynamic Reconfigure) in a
system board will fail.

This is due to the design of the hpost process and the fact that to DR in a
system board, we attempt to set the board to use the same Address, Response and
Data bus configuration. As we cannot achieve this configuration, hpost FAILs out
the SB (System board).

This is in fact, not a bug. By design, we do not change current bus
configuration as part of an hpost (Host Power On Self Test) for a DR.

hpost can indeed make bus configuration decisions when the domain is being
started up from cold, just not during a DR. The method to change the bus
configuration on a running domain is 'setbus'.

Changes

{CHANGE}

Cause

{CAUSE}

Solution

Resolution
This failure is characterized by the very early failure of POST.

We can see in the complete post log below that POST fails very early on, and
the only real details we have to go on are the ESMD (Environmental Status
Monitoring Daemon) blacklist message, listing "cplane 1" (aka CSB1) as disabled
and that there is No minimum system left after blacklist file.

So, there is not a great deal to go on in the, but there are other places we
can look for the data.

First - The hpost log we have already discussed:

========================================================================
# SMI Sun Fire 12/15/20/25K POST log opened Tue Jun 13 00:39:13 2006
# hpost version 1.5 Generic 120648-04 Apr 24 2006 12:10:28
# libxcpost.so v. 1.5 Generic 120648-04 Apr 24 2006 11:48:42
# pid = 9538 level = 16 verbose_level = 20
# SC name: e25k1-sc1. ChHostID: XX00XX00XX00X
# Domain Id = A
# Parent PID = 6081: dxs
# Cmdline: /opt/SUNWSMS/SMS1.5/bin/hpost -dA -H16.0

Significant contents of .postrc (platform)
/etc/opt/SUNWSMS/SMS1.5/config/platform/.postrc:
# ident "@(#)postrc 1.1 01/04/02 SMI"
Reading domain blacklist file /etc/opt/SUNWSMS/config/A/blacklist ...
# ident "@(#)blacklist 1.1 01/04/02 SMI"
Reading platform blacklist file /etc/opt/SUNWSMS/config/platform/blacklist ...
# ident "@(#)blacklist 1.1 01/04/02 SMI"
Reading system ASR blacklist file /etc/opt/SUNWSMS/config/asr/blacklist ...
cplane 1 # ESMD Power Failure 0610.1718.56
SEEPROM probe took 0 seconds.
Reading Component Health Status (CHS) information ...
No minimum system left after blacklist file! Bailing out!
Exitcode = 48: No system after domain, .postrc, blacklist, etc.
POST (level=16, verbose=20, -H16.0) execution time 1:09
# SMI Sun Fire 12/15/20/25K POST log closed Tue Jun 13 00:40:22 2006

========================================================================

Then we have the platform log, located in
/var/opt/SUNWSMS/adm/platform/messages, or in /<explorer-
dir>/sf15k/adm/platform/messages if you are checking an explorer.

In this log, there may be messages that tell us more about previous failures.
In the case that this document was written about, there was a prior CSB power
supply failure.

========================================================================
Jun 13 00:29:40 2006 e25k1-sc1 esmd[5620]: [2000 2674760785927349 ERR
SysControl.cc 1536] A failure has been detected on redundant PS at
ps1_power_good_l; located on CSB at CS1. SCHEDULE REPLACEMENT of CSB at CS1 as
soon as possible to restore redundancy.
========================================================================

Of course, this board has redundant power supplies, so the platform kept
running after this failure, however, as the message notes, we should schedule a
replacement as soon as possible.

The trick is that this also causes an entry to be made in the ASR blacklist,
which hpost must obey.

From the Solaris[TM] side, within the domain, the details you would get as a
result of this type of failure is a somewhat generic failure:

# cfgadm -c configure SB13
Jun 23 10:13:13 v4u-15ka-e-epar02 drmach: WARNING: SMS hpost reported
error, see POST log for details
cfgadm: Hardware specific failure: test SB13: SMS hpost reported error,
see POST log for details

Of course, this directs us to check the POST output.

So - We have set the scene, and now know that with the CSB partially failed,
and listed as blacklisted in the ASR blacklist, we can't DR a system board into
the domain.

What is the solution?

The only supported and sensible answer to this question is to replace the CSB
with the failed power supply!

An example process follows: (Note: These processes are covered in great detail
in the 15K and 25K service manuals. This document only supplies the minimum
detail)

Let's assume that the failed CSB is CSB1, and the main SC is SC1.
We'll assume this config, as it's the hardest to workaround.

In essence, we need to get SC0 (The SC in the *good* CSB) to be main, stop
using the failed CSB and then replace it.
- Failover from SC1 to SC0
- setfailover on
wait for sync to complete
- setfailover force
This fails the SC's over.
- Stop using CSB1
- setbus -c cs0
This disables the Address, Data and Response busses for
all CSB1 supported paths. This means we are ready to
replace the CSB
- Halt SC1
- From SC0, poweroff SC1, the SCPER1 and CSB1
- poweroff sc1 scper1 csb1
- Remove SC1, SCPER1 and CSB1
- Install new CSB1
- Install SCPER1 and SC1 *in that order*
- SC1 automatically powers on and boots
- setfailover on
Wait a few minutes for sync
- Start using all busses again.
- setbus -c cs0,cs1
- Done!

Relief/Workaround
Using setbus, we can workaround this issue.
Note: This assumes that CS1 is the failed CSB, and all work is done on the main SC.

- Disable all CSB1 supported busses
- setbus -c cs0

- Perform the DR operation

- Replace the CSB at the first opportunity! Redundancy in your platform
depends on having both CSB's working at 100% capacity. See the 'solution' above.

Additional Information
Apollo Escalation Id: 1-17735996
Radiance Case: 10868708
Radiance Task: 21830653

See also - Technical Instruction <Document 1003308.1> Sun Fire[TM]12K/15K/E20K/E25K: esmd warning; A power failure has been detected on a redundant power supply at ...

Product
Sun Fire E25K Server
Sun Fire E20K Server
Sun Fire 15K Server
Sun Fire 12K Server
Amazon 20/25 Phase 2 Hardware

Keywords
starcat, CSB, setbus, dynamic reconfiguration, DR, SMS, ASR, hpost

Previously Published As
86064

Attachments

This solution has no attachment