Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1321335.1
Update Date:2011-05-25
Keywords:

Solution Type  Troubleshooting Sure

Solution  1321335.1 :   Sun Enterprise[TM] 10000: Troubleshooting Recordstop Dumps  


Related Items
  • Sun Enterprise 10000 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  




In this Document
  Purpose
  Last Review Date
  Instructions for the Reader
  Troubleshooting Details


Applies to:

Sun Enterprise 10000 Server - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Information in this document applies to any platform.

Purpose

This document provides troubleshooting information for various recordstop dump events.

Last Review Date

May 11, 2011

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Bogus Uncorrectable Error reported on bit 32, syndrome 13

There is a problem in the Starfire's XDB algorithm that checks the syndrome bit to identify the bad bit and determine if it is a single or multiple error. The XDB is coded to expect a syndrome of 12 for bit 32. The syndrome for bit 32 really is 13. The result is that the XDB will request a Recordstop but instead of recording a single bit error (CE), it will record an multiple bit error (UE).

From the wfail output, we see something like the following:

redxl> wfail
(output omitted)
XDB 7.2 EccErrFlags[11:0] = 220
EccFlg[5]: Uncorrectable error in ldat bus lo half, bits [71:0]
EccFlg[11:8]: Error count = 2
ldat[ 71: 0]= D3 00FD0ECD 00000004 (xmux_par[5:0]= 02) syn= 13: bit 32 [06]
(output omitted)
Bear in mind that the UE is misreported by the XDB only. Solaris detects and reports this error properly. As a result, only the Recordstop Dump File will reflect a UE with Bit 32 in error in the XDB output. The flip-side of this problem will be the XDB reporting a Syndrome 12 Correctable Error, but not identify which Bit was Corrected. In reality, Syndrome 12 maps to an Uncorrectable Error (UE), and cannot be mapped to a single bit.

  • Only a recordstop is generated.
  • Solaris properly detects, handles, and reports the error properly. Only the XDB output in the Recordstop file is in error.
  • This error is in XDB code and has no relation to system board hardware.
  • This problem will not be fixed.

Correctable ECC Error (CE) Processor X Dtags

From the wfail output, we see something like the following:

redxl> wfail
(output omitted)
CIC 7.2 ErrFlags[61:0] = 00000001 00000002 (after mask)
ErrFlag[1]: Correctable ECC Error (CE) Processor 1 Dtags
ErrFlag[32]: Repeated Error
Proc 1 Dtag ECCSyn[13: 8] = 23: CE: bit 00 Dtag SRAM 7.2.0
FAIL Proc 7.1 in all configs using CIC2: : Arbstop/Recordstop detected by cic
(*** NOTE: Implicated FRU is sysboard 7)
(output omitted)
The above error should be analyzed in a way consistent with other Correctable Error recordstops. This means that the first instance or event for any given error against a particular DTag SRAM (in this example, CIC 7.2- DTag SRam 0) should be diagnosed as a soft error, and no action should be taken against it.

Swap the "Implicated FRU" (SB7 in example) when the third failure occurs on any one CIC.

NOTE: Blacklisting the affected CPU (proc 7.1 in example) could be used as a short term workaround.




Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback