Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1011988.1
Update Date:2011-06-03
Keywords:

Solution Type  Technical Instruction Sure

Solution  1011988.1 :   Sun Fire[TM] 12K/15K/E20K/E25K: How to translate a DIMM Slot SubSlot message from the System Controller platform messages file into a physical location.  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
216430


Applies to:

Sun Fire 12K Server
Sun Fire 15K Server
Sun Fire E20K Server
Sun Fire E25K Server
All Platforms

Goal

How does the following error, which would be logged to a system controller's (SC) /var/opt/SUNWSMS/SMS/adm/platform/messages file get deciphered to tell which dimm is being reported in error?
Jul 14 12:30:11 2003 sc0 dsmd[1686]: [0 277533849565489 ERR SoftErrorHandler.cc 665] DIMM Slot 0 SubSlot 24
Jul 14 12:30:12 2003 sc0 dsmd[1686]: [2552 277534570695148 ERR SoftErrorHandler.cc 675] Soft Error: Comp ID : 0x76 Error Code: 1 Error Type: 1 Error Bit/Pin: 25

Solution

When Solaris[TM] on the domain detects an ECC error, it will log an error detailing the encountered problem. It will then send an error notification to SMS (System Management Services) on the SC to allow for error tracking of FRU's. SMS will print a message like the one above when it receives this notification from the domain. Using the information in this document, messages like the one above can be matched to the error messages produced by Solaris in the domain.

First, the error shows us it is a DIMM in error. The word DIMM means a main memory dimm, not an Ecache or L2SRAM dimm. If your message shows "E$ Slot SubSlot" messages, you must use Technical Instruction Document 1006140.1 as a guide for those errors.

Second, the message identifies a "Slot" location.

Third a "SubSlot" location.

So, now, we must determine the actual part in error using the Slot and SubSlot locations to provide a physical dimm which is reporting the error.

"Slot" refers to the system board which reports the error. There is a possibility of up to 18 total system boards in a platform (depending on the platform type), so this type of error will report a Slot between 0-17.

"SubSlot" means which dimm slot on the system board which reports the error. There are 32 total dimm slots on a system board; therefore, the numbering goes 0-31.

--------------------------
SubSlot	Translation Table
----------------------------
Subslot	Physical    J#####
#	Location
----------------------------
0	CPU0/B0/D0  J13300
1	CPU0/B1/D0  J13301
2	CPU0/B0/D1  J13400
3	CPU0/B1/D1  J13401
4	CPU0/B0/D2  J13500
5	CPU0/B1/D2  J13501
6	CPU0/B0/D3  J13600
7	CPU0/B1/D3  J13601
8	CPU1/B0/D0  J14300
9	CPU1/B1/D0  J14301
10	CPU1/B0/D1  J14400
11	CPU1/B1/D1  J14401
12	CPU1/B0/D2  J14500
13	CPU1/B1/D2  J14501
14	CPU1/B0/D3  J14600
15	CPU1/B1/D3  J14601
16	CPU2/B0/D0  J15300
17	CPU2/B1/D0  J15301
18	CPU2/B0/D1  J15400
19	CPU2/B1/D1  J15401
20	CPU2/B0/D2  J15500
21	CPU2/B1/D2  J15501
22	CPU2/B0/D3  J15600
23	CPU2/B1/D3  J15601
24	CPU3/B0/D0  J16300
25	CPU3/B1/D0  J16301
26	CPU3/B0/D1  J16400
27	CPU3/B1/D1  J16401
28	CPU3/B0/D2  J16500
29	CPU3/B1/D2  J16501
30	CPU3/B0/D3  J16600
31	CPU3/B1/D3  J16601
----------------------------

On the second line, SMS provides more detailed information about the error.

The "Comp ID" (Component ID) is another encoding of the Slot and SubSlot. The component ID can be broken down as follows:

-------------------------
Component ID Upper Nibble
-------------------------
Upper    CPU
Nibble
-------------------------
0    MaxCAT CPU 0
1    MaxCAT CPU 1
4    SB CPU 0
5    SB CPU 1
6    SB CPU 2
7    SB CPU 3

The lower nibble details the specific dimm associated with the component identified by the upper nibble.

-------------------------
Component ID Lower Nibble
-------------------------
Lower    dimm
Nibble
-------------------------
0    MaxCAT E$ 0
1    MaxCAT E$ 1
2    CPU E$ 0 (Jx400)
3    CPU E$ 1 (Jx300)
6    B0/D0
7    B1/D0
8    B0/D1
9    B1/D1
a    B0/D2
b    B1/D2
c    B0/D3
d    B1/D3

Thus, a Comp ID of 0x76 corresponds to P3/B0/D0 on the System Board of the identified Slot.

The "Error Code" is decoded as follows:

--------------------------------------------------
Code     Error
--------------------------------------------------
0    UNKNOWN
1    CE   Correctable ECC error
2    UE   Uncorrectable ECC error
3    EDC  Correctable ECC error from E$
4    EDU  Uncorrectable ECC error from E$
5    WDC  Correctable E$ write-back ECC
6    WDU  Uncorrectable E$ write-back ECC
7    CPC  Copy-out correctable ECC error
8    CPU  Copy-out uncorrectable ECC error
9    UCC  SW handled correctable ECC
a    UCU  SW handled uncorrectable ECC
b    EMC  Correctable MTAG ECC error
c    EMU  Uncorrectable MTAG ECC error
--------------------------------------------------

The "Error Type" is decoded as follows:

-------------------------
Error Type
-------------------------
Value    Error Type
-------------------------
0    Unknown
1    Single bit error
2    Double bit error
3    Triple bit error
4    Quad bit error
5    Multiple bit error

Replacement of the implicated component identified through use of this document should happen for an Uncorrectable event. A correctable event should be replaced per Best Practices recommendations only. Contact Sun Support Services for details of Best Practice or to set up a service request for an error of this type. Please reference this document if contacting service and provide log data (SC explorer data preferred) so support can confirm the analysis.



Product
Sun Fire 15K Server
Sun Fire 12K Server
Sun Fire E25K Server
Sun Fire E20K Server


Internal Section

CAUTION
Using the /var/opt/SUNWSMS/SMS/adm/platform/messages file for identification of a E$ dimm failure alone is not recommended. This document is solely to be used to explain how to translate these error messages into the correct dimm being reported in error. Standard troubleshooting of E$ related errors should involve analysis of rstop/dstop/xcstate files, post logs, and following the instructions layed out by the FINs, Sun Alerts, and Problem Resolution articles which relate to analysis of these types of errors. This error alone only identifies the E$ dimm issue and should be used solely as a confirmation of the analysis of the other files previously mentioned.

References:
  • See Technical Instruction Document 1004903.1 for details about the above error codes.
  • See here for a picture of an USIII system board.
  • See here for a picture of an USIV system board.
  • FAB Document 1000799.1: Too many Memory DIMMs are being unnecessarily replaced on the UltraSPARC II, III, III+, IIIi and, IV families of systems, increasing customers' service actions and related system downtime.
  • Document 1004903.1: Event Messages for UltraSPARC-III[R], UltraSPARC-III+[R], UltraSPARC-IIIi[R], UltraSPARC-IV[R] and UltraSPARC-IV+[R] CPU Modules
  • Document 1018209.1: L2SRAM / DIMM Misdiagnosis Issues
  • Sun Alert Document: 1000922.1 Sun Fire 3800/4800/4810/6800, Sun Fire 12K/15K, Sun Fire V1280, Netra 1280 Server Domains with 900MHz CPUs May Panic or Hang Due to Incorrect L2 SRAM Parameter Settings

Keywords: 12k, 15k, e20k, e25k, memory dimm, system controllers, SC, translate, subslot, slot, SubSlot, Slot, DIMM

Previously Published As 71129



Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback