Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1010757.1
Update Date:2009-09-24
Keywords:

Solution Type  Technical Instruction Sure

Solution  1010757.1 :   Sun Fire[TM] 12K/15K Servers: Voltage Error on CPU Leads to Blacklisting the PROCPAIR  


Related Items
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
214857


Description
A voltage problem is detected on a single CPU; ASR (Automatic System
Recovery) blacklists it's PROCPAIR; and the domain is reset.  Blacklisting
the PROCPAIR means that two CPUs and their memory are disabled and removed
from the domain configuration.
This document expains why the esmd (Event Status Monitorig Daemon) disables
and removes two CPUs as the result of a voltage problem on a single CPU.
summarized above.  This behavior might seem incorrect, but in fact the
recovery action is exactly as it was designed to be.


Steps to Follow
The following is an example of a voltage fault which might be logged in the
/var/opt/SUNWSMS/SMS/adm/platform/messages file on the System Controller (SC):
Jan 19 03:34:33 2004 s2oc-sc0 esmd[2511]: [1919 216320102467983 ERR
DetectorV.cc609] A low voltage or power supply has been detected on
Core3, located on CPU at SB8. The voltage detected is 0.02v; should be
1.31v to 1.47v. PROCPAIR at SB8/PP1 is being removed from the domain and
powered off. Check all hardware for the cause.
Jan 19 03:34:33 2004 s2oc-sc0 esmd[2511]: [0 216320149848762 NOTICE
SysControl.cc 5296] Component PROCPAIR at SB8/PP1 has been blacklisted
Jan 19 03:34:33 2004 s2oc-sc0 esmd[2511]: [1930 216320225206159 NOTICE
SysControl.cc 6113] PROCPAIR at SB8/PP1 has been powered off: ecode=0
In the preceding error message, Core3 (CPU3) on SB8 has a low voltage.
***************************************************************
* NOTE:  The voltage tolerances are defined on the SC in the  *
*        /etc/opt/SUNWSMS/SMS/config/esmd_tuning.txt file.    *
*                   DO NOT EDIT THIS FILE!                    *
***************************************************************
In the preceding error message, esmd has blacklisted "Component
PROCPAIR at SB8/PP1."
PP0 = PROCPAIR0 = CPU0 and CPU1
PP1 = PROCPAIR1 = CPU2 and CPU3
The reported voltage fault on only one CPU (CPU3) results in two
CPUs (CPU2 and CPU3) being removed from the domain configuration
through blacklisting.
The decision to blacklist the PROCPAIR for the failure of the
single CPU is a result of a compromise of differing forces made upon POST
with regards to availability.  A POST is the hardware tests executed
against components prior to entering into OBP.  These tests confirm the
hardware sanity of the components.
****************************************************************
*              Differing Forces made upon POST                 *
****************************************************************
* FORCE 1   POST needs to be able to exclude faulty components *
*           from the domain configuration so that future       *
*           failures don't occur.                              *
*                                                              *
* FORCE 2   POST should allow as many resources as possible to *
*           be configured into the domain to minimize          *
*           domain impact as much as possible.                 *
****************************************************************
It is important to note that a voltage fault reported by a single CPU
might not actually be a problem limited to that CPU itself.  The same
voltage issue could also be affecting its "related" components,
such as the BBC asic, DCDS asic, and so on.  Ultimately, the fault
could be the result of a power distribution issue, representative of a
larger issue on board.
The actual reason for the voltage fault might not be fully known, and the
number of components that are affected by it might also be unknown.
Because of this "unknown" factor, there are two approaches for dealing
with voltage issues on board, the Conservative Approach and the Aggressive
Approach.  These two approaches relate directly to the two POST forces
described previously:
************************************************************************
* Conservative:  Disable the entire System Board.  Now, any future     *
*                outage is prevented if "related" components are       *
*                affected by this voltage problem.  FORCE 1.           *
*                                                                      *
*  Aggressive:   Disable only the component reporting the voltage      *
*                problem.  This leaves as many resources as possible   *
*                available to the domain, but there is some            *
*                risk associated                                       *
*                with this approach.  FORCE 2.                         *
************************************************************************
Arguments for each approach can be made and no one argument is
incorrect.  One customer might believe that disabling the whole System
Board is the best decision; another customer might believe that it is
absolutely unacceptable to lose that many resources.  Neither customer is
incorrect.  A compromise is necessary to meet the needs of both of these
forces.
Here is what esmd does as a compromise to meet these different POST forces:
*   Force 1 results in the exlusion of faulty components from the domain
configuration, which is done by disabling the CPU, which reports the
voltage problem, and its PROCPAIR partner.  This isolation to a
PROCPAIR is to prevent a problem on a "CPU-related" component, such
as the BBC asic, or perhaps to prevent the DCDS asic from
causing  further incidents.  Each PROCPAIR shares components, such
as these asics.  Thus, the PROCPAIR is a logical place to isolate.
*   Force 2 results in the configuration of as many resources as
possible into the domain configuration.  The remaining PROCPAIR
is allowed into the domain configuration so that the domain can
function.  For a single board domain, the Conservative Approach
leaves the domain down until service can take place.  The
Aggressive Approach leaves an exposure if
a "related" component has a voltage issue of its own.
A compromise configures the domain with as minimial a resource impact
as possible while also providing as much error isolation as possible.
Ultimately the following are fulfilled: the differing POST force needs,
supply domain availability, and fault isolation.  This compromise is not
perfect, but it is the appropriate way to isolate faulty
components from the configuration to prevent future outages, while also
allowing as many resources as possible to remain available for domain
production until a maintenence window is available to resolve the issue at
hand.


Product
Sun Fire 15K Server
Sun Fire 12K Server

Internal Comments
This article is a result of KGap request ID 263, a request made to clarify
why the PROCPAIR blacklist behavior exists.

A false over-voltage issue existed in SMS 1.2 and 1.3 software:
Sun Alert 53625 "CPU0/CPU1 May Be Disabled on Sun Fire 12K/15K System
Boards Resulting in Domain Interruption"

12k, 15k, 12K, 15K, esmd, voltage, procpair, PROCPAIR, blacklisted, ASR
Previously Published As
76240

Change History
Date: 2005-09-27
User Name: 18392
Action: Update Canceled
Comment: *** Restored Published Content *** fixed techgroup
Version: 0
Date: 2005-09-27
User Name: 18392
Action: Update Started
Comment: fixing techgroup
Version: 0
Date: 2005-06-27
User Name: 95826
Action: Update Canceled
Comment: *** Restored Published Content *** canceling update as updater is no longer within Sun
Version: 0
Date: 2005-06-27
User Name: 95826
Action: Reassign
Comment: reassigning document as updater is no longer within Sun
Version: 0
Date: 2005-01-09
User Name: 132461
Action: Update Started
Comment: spl/fmt/upd
Version: 0
Date: 2004-05-24
User Name: c8840
Action: Approved
Comment: This document was edited and is now ready for publication.
Version: 0
Date: 2004-05-19
User Name: c8840
Action: Accepted
Comment:
Version: 0
Date: 2004-05-18
User Name: 101037
Action: Approved
Comment: Good doc
Version: 0
Date: 2004-05-18
User Name: 101037
Action: Accepted
Comment:
Version: 0
Date: 2004-05-18
User Name: 103287
Action: Approved
Comment: Please review. This doc was written to help explain why procpair's are disabled for the voltage event of a single cpu. Nothing more than information...

Joshua Freeman, PTS Server ESG
Version: 0
Date: 2004-05-17
User Name: 103287
Action: Created
Comment:
Version: 0
Product_uuid
29e4659c-0a18-11d6-9fa1-e67bbc033df8|Sun Fire 15K Server
077fd4c5-df8f-4320-ad69-7d01603a674d|Sun Fire 12K Server

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback