Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-73-1000096.1
Update Date:2010-09-02
Keywords:

Solution Type  FAB (standard) Sure

Solution  1000096.1 :   Running POST on one domain may Dstop all other running domains on Sun Fire 15K systems with SMS 1.1  


Related Items
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Sun FAB>Standard>Controlled Proactive
  •  

PreviouslyPublishedAs
200128


Product
Sun Fire 15K Server

Running POST on one domain may Dstop all other running domains (see details below).

Impact

A software error on one domain (such as a heartbeat failure, panic timeout, or error-reset) can cause another domain to DStop on Sun Fire 15K systems running SMS 1.1.  The manifestation of this issue may cause the POST running on one domain to Dstop all other running domains.  While the occurrence is rare, the impact is platform wide.  Depending upon domain configuration and applications, down time can be several hours.  This problem is intermittent and may be related to a domain sync operation on the centerplane (reset of unused ports).

Running POST on one domain means that the power-on self tests are executed on any domain in the system.  This is done to initially bring a domain online, a DR attach of a board (not currently supported), or a recovery action performed by the SMS software to get a domain back up and running after a reboot, panic, or Dstop.

The AMX flow control error shown above is the key message.  The system will recover automatically via ASR (automatic system recovery).  After recording the Dstop information, SMS restarts the domain(s).        

Any SMS 1.1 installations without patch 112080-05 or later installed are susceptible to this problem.  SMS 1.2 and higher are not affected by this issue.

The true cause of the problem is the AMX ASIC which doesn't handle port resets correctly.  The bug fix changes how POST performs the reset to ensure it's done safely.                   

A Dstop, or Domain Stop, occurs when the hardware detects an unrecoverable error.  The ASICs in the system cease processing transactions as quickly as possible to prevent further corruption of data and facilitate debugging.  It also occurs during the centerplane reset of ports.  The AMX has a problem with the reset of ports not done under domain sync.  Changing the reset so that it is done under domain sync causes the problem to go away.

Symptoms

A message in the platform message log (/var/opt/SUNWSMS/adm/platform/messages) would report:

    Jan 17 20:25:55 2002 swmtft901 hwad[22514]: [1156 1693005732870614 ERR
    InterruptHandler.cc 2127] Domain Stop interrupt detected, domain XXX
              
SMS then creates a Dstop dump file in /var/opt/SUNWSMS/adm/[XXX]/dump.  The file name is dsmd.dstop.YYMMDD.hhmm.ss (for this example).  If this dump file is opened with "redx" and the "wfail" command is issued, the output below is reported.  For example:

        sc% redx -cl
        redx> dumpf load dsmd.dstop.020117.2025.55)
        redx> wfail
        ...ouptut below...            

The Dstop signature of this issue is as follows:

        SDI EX03/S0  Master_Stop_Status0[31:0] = 7004004F

              MStop0[3:0]: All SDI logic is DStopped + Recordstopped.

        SDI EX03/S0  Dstop0[31:0] = 12018200

              Dstop0[16]: D    DARB texp requests all Dstop (M)   

              Dstop0[25]: D 1E AXQ requests all Dstop (M)

              Dstop0[28]: D    Slot0 asserted Error, enabled to cause Dstop (M)

        AXQ EX03 ( 3) Error_Flag_02[31:0] = 04008400  Mask = 0000FFFF

              Err2[26]: D 1E AMX 0-3 hs flow control didn't arrive simultaneously

        FAIL EXB EX3:  Dstop/Rstop detected by AXQ.

        Primary service FRU is EXB EX3.

        SDI EX04/S0  Master_Stop_Status0[31:0] = 0004000F

              MStop0[3:0]: All SDI logic is DStopped + Recordstopped.

        SDI EX04/S0  Dstop0[31:0] = 02018200

              Dstop0[16]: D    DARB texp requests all Dstop (M)

              Dstop0[25]: D 1E AXQ requests all Dstop (M)

        AXQ EX04 ( 4) Error_Flag_03[31:0] = 30009000  Mask = 21005EFF

              Err3[28]: D 1E AMX data ECC uncorrectable error            

              Err3[29]: R    AMX data ECC correctable error      

        FAIL EXB EX4:  Dstop/Rstop detected by AXQ.

        Primary service FRU is EXB EX4.


Resolution

An Authorized Enterprise Services Field Representative may avoid the above mentioned issue by following the recommendations as shown below.

For a permanent fix, install SMS 1.1 Patch 112080-05 (or later), or upgrade to SMS1.2.  This patch is specifically for SMS 1.1, and is not tied to any one particular Solaris OS release.

References:

BugId:    4505473 - AMX data ECC uncorrectable error.
PatchId:  112080-05 - SMS 1.1: Patch IBIST for pause wafer change.
ESC:      534366 - All domain down due to DSTOP.
SunAlert: 42881

Previously Published As
100299
Internal Eng Business Unit Group
SSG ES (Enterprise Systems)

Internal Kasp FAB Legacy ID
100299, I0788-1 (FIN)

Internal SA-FAB Eng Submission
Running POST on one domain may Dstop all other running domains on Sun Fire 15K systems with SMS 1.1


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback