Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1006140.1
Update Date:2010-09-13
Keywords:

Solution Type  Technical Instruction Sure

Solution  1006140.1 :   Sun Fire[TM] 12K/15K/E20K/E25K: How to translate an E$ Slot SubSlot messages from the System Controller platform messages file into a physical location.  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
208600


Applies to:

Sun Fire 12K Server
Sun Fire E25K Server
Sun Fire 15K Server
Sun Fire E20K Server
All Platforms

Goal

Identify the location of the Ecache DIMM referenced in the following message:
Jul 30 20:00:20 2003 sysconf1-1 dsmd[556]: [0 1384804385762372 ERR SoftErrorHandler.cc 660] E$ Slot 16 SubSlot 7
Jul 30 20:00:20 2003 sysconf1-1 dsmd[556]: [2552 1384805023735665 ERR SoftErrorHandler.cc 675] Soft Error:
Comp ID : 0x72 Error Code: 7 Error Type: 1 Error Bit/Pin: 0

Solution

Steps to Follow

When Solaris[TM] on the domain detects an ECC error, it will print an error in the /var/adm/messages file detailing the encountered problem. It will then send an error notification to SMS (System Management Services) on the SC to allow for error tracking of FRU's. SMS will print a message like the one above when it receives this notification from the domain. Using the information in this document, messages like the one above can be matched to the error messages produced by Solaris in the domain. Now let's take a look at the SMS error message itself.

First, the error shows it is an E$ (ecache - L2SRAM) in error.

NOTE:  If your message shows "DIMM Slot SubSlot" messages, you must use 
<Document 1011988.1> How to translate a DIMM Slot SubSlot message
from the System Controller platform messages file into a physical
location as a guide for those errors.

Second, it identifies a "Slot" location.

Third a "SubSlot" location.

So, now, we must determine the actual part in error using the Slot and SubSlot locations to provide a E$ dimm which reports the error.

"Slot" refers to the system board which reports the error. There is a possibility of 18 total system boards in a platform (depending on the platform type), so this type of error will report a Slot between 0-17.

"SubSlot" means which E$ dimm on the system board reports the error. There are 8 total E$ dimms on a system board; therefore, the numbering goes 0-7. You might think the numbering would show that E$ dimm 0 for cpu 0 would be SubSlot 0, but that is not the case. It is actually E$ dimm 1 for cpu 0 which is SubSlot 0. See the chart below for the translations.

      --------------------------
SubSlot Translation Table
--------------------------
Subslot Physical J####
# Location
--------------------------
0 CPU0/E1 J4300
1 CPU0/E0 J4400
2 CPU1/E1 J5300
3 CPU1/E0 J5400
4 CPU2/E1 J6300
5 CPU2/E0 J6400
6 CPU3/E1 J7300
7 CPU3/E0 J7400
--------------------------

On the second line, SMS provides more detailed information about the error.

The "Comp ID" (Component ID) is another encoding of the Slot and SubSlot. The component ID can be broken down as follows:

      -------------------------
Component ID Upper Nibble
-------------------------
Upper CPU
Nibble
-------------------------
0 MaxCAT CPU 0
1 MaxCAT CPU 1
4 SB CPU 0
5 SB CPU 1
6 SB CPU 2
7 SB CPU 3

The lower nibble details the specific dimm associated with the component identified by the upper nibble.

      -------------------------
Component ID Lower Nibble
-------------------------
Lower dimm
Nibble
-------------------------
0 MaxCAT E$ 0
1 MaxCAT E$ 1
2 CPU E$ 0 (Jx400)
3 CPU E$ 1 (Jx300)
6 B0/D0
7 B1/D0
8 B0/D1
9 B1/D1
a B0/D2
b B1/D2
c B0/D3
d B1/D3

Thus, a Comp ID of 0x76 corresponds to P3/B0/D0 on the System Board of the identified Slot.

The "Error Code" is decoded as follows:

      --------------------------------------------------
Code Error
--------------------------------------------------
0 UNKNOWN
1 CE Correctable ECC error
2 UE Uncorrectable ECC error
3 EDC Correctable ECC error from E$
4 EDU Uncorrectable ECC error from E$
5 WDC Correctable E$ write-back ECC
6 WDU Uncorrectable E$ write-back ECC
7 CPC Copy-out correctable ECC error
8 CPU Copy-out uncorrectable ECC error
9 UCC SW handled correctable ECC
a UCU SW handled uncorrectable ECC
b EMC Correctable MTAG ECC error
c EMU Uncorrectable MTAG ECC error
--------------------------------------------------

The "Error Type" is decoded as follows:

      -------------------------
Error Type
-------------------------
Value Error Type
-------------------------
0 Unknown
1 Single bit error
2 Double bit error
3 Triple bit error
4 Quad bit error
5 Multiple bit error

Replacement of the implicated component identified through use of this document should happen for an Uncorrectable event. A correctable event should be replaced per Best Practices recommendations only. Contact Sun Support Services for details of Best Practice or to set up a service request for an error of this type. Please reference this document if contacting service and provide log data (SC explorer data preferred) so support can confirm the analysis.



Internal Comments
*CATION*
 Using the /var/opt/SUNWSMS/SMS/adm/platform/messages file for identification of a E$ dimm failure alone is not recommended.
This document is solely to be used to explain how to translate these error messages into the correct dimm being reported in error.
Standard troubleshooting of E$ related errors should involve analysis of rstop/dstop/xcstate files, post logs, and following the instructions
laid out by the FABs, Sun Alerts, and Problem Resolution articles which relate to analysis of these types of errors. This error alone only
identifies the E$ dimm issue and should be used solely as a confirmation of the analysis of the other files previously mentioned.

References:
See for details about the above error codes.
See http://sunsolve.central.sun.com/handbook_internal/Devices/System_Board/SYSBD_SunFire_USIIICu.html for a picture of an USIII system board.
See http://sunsolve.central.sun.com/handbook_internal/Devices/System_Board/SYSBD_SunFire_USIV.html for a picture of an USIV system board.
Alert <Document 1000922.1> Sun Fire 3800/4800/4810/6800, Sun Fire 12K/15K, Sun Fire V1280, Netra 1280 Server Domains with 900MHz CPUs May Panic or Hang Due to Incorrect L2 SRAM Parameter Settings.

12k, 15k, e20k, e25k, E$, ecache, L2SRAM, system controllers, SC, translate, subslot, slot, SubSlot, Slot, DIMM
Previously Published As 71043

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback