Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1006517.1
Update Date:2011-06-03
Keywords:

Solution Type  Troubleshooting Sure

Solution  1006517.1 :   Troubleshooting Sun Fire[TM] Uncorrectable CPU and Memory Error(s) on Solaris[TM] 8 and 9  


Related Items
  • Sun Fire E6900 Server
  •  
  • Sun Blade 2000 Workstation
  •  
  • Sun Fire V480 Server
  •  
  • Sun Fire E25K Server
  •  
  • Sun Fire 280R Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire V880z Visualization Server
  •  
  • Sun Netra 1280 Server
  •  
  • Sun Fire V890 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Netra 20 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire V880 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire 15K Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Blade 1000 Workstation
  •  
  • Sun Netra 1290 Server
  •  
  • Sun Fire 4810 Server
  •  
  • Sun Fire V490 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Misc
  •  
  • GCS>Sun Microsystems>Servers>NEBS-Certified Servers
  •  
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  
  • GCS>Sun Microsystems>Servers>Entry-Level Servers
  •  
  • GCS>Sun Microsystems>Desktops>Workstations
  •  
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
209116


Applies to:

Sun Blade 1000 Workstation
Sun Blade 2000 Workstation
Sun Netra 1280 Server
Sun Netra 1290 Server
Sun Netra 20 Server
All Platforms

Purpose

This document addresses uncorrectable CPU/Memory errors reported on systems running Solaris[TM] 8 and Solaris[TM] 9.

Your system may have one or more of the following symptoms:
  • The system may have unexpectedly rebooted and cause is unknown.
  • The system may have received UE, ECC errors, or recoverable memory errors.
  • The system may be described as crashed, gone down, paniced, panic'd, panic'ed, panicked, rebooted, or received CPU or memory errors
  • Example error messages which may have been reported are as follows:

A. Uncorrectable ECC error on from a read from system memory

Main memory uncorrectable ECC error detected by CPU3 from the bank of DIMMs in Slot A: J8100 J8101 J8201 J8200

SUNW,UltraSPARC-IV:
WARNING: [AFT1] Uncorrectable system bus (UE) Event detected by CPU3 in
Privileged mode at TL=0, errID 0x... AFSR 0x00100004.000000aa AFAR
0x000000a0.0c06f1e0  Fault_PC 0x1015725c Esynd 0x00aa Slot A: J8100
J8101 J8201 J8200
SUNW,UltraSPARC-IV: [AFT1] errID 0x... Two Bits were in error

Main memory uncorrectable ECC error for a prefetch or store queue fill read.

SUNW,UltraSPARC-IV: [ID 581396 kern.warning] WARNING: [AFT1] DUE Event detected by CPU0 at TL=0, errID 0x... AFSR 0x00400000.000000aa AFAR 0x000000a0.0c0ab1f0 Fault_PC 0xff1c1c80 Esynd 0x00aa Slot A: J8100 J8101 J8201 J8200
SUNW,UltraSPARC-IV: [ID 468316 kern.notice] [AFT1] errID 0x... Two Bits were in error

A Main memory uncorrectable ECC error detected by Schizo id 9

pcisch: WARNING: uncorrectable error detected by pci0 (safari id 00000000.00000009) during DVMA read transaction
pcisch:     Transaction was a block operation.
pcisch:     dvma access, Memory safari command, address 000000d0.cb1489a0, owned_in not asserted.
pcisch:     AFSR=40000000.89000063 AFAR=000000d0.cb1489a0, quad word offset 00000000.00000002, Memory Module Slot D: J3100 J3101 J3201 J3200 id 9.
pcisch:     mtag 0, mtag ecc syndrome 0

Uncorrectable Mtag ECC errors from main memory cause a fatal reset, domain pause or dstop depending on the platform.

B. CPU Uncorrectable ECC errors

SUNW,UltraSPARC-III+: WARNING: [AFT1] EDU Event detected by CPU1 at TL=0, errID 0x.... AFSR 0x00000018.0000017c AFAR 0x000000a0.0c0ab1f0 Fault_PC 0x1000c19c Esynd 0x017c
SUNW,UltraSPARC-III+: [AFT1] errID 0x.... Four Bits were in error

UCU     uncorrectable E$ ECC event
EDU:ST  uncorrectable E$ ECC event for store merge
EDU:BLD uncorrectable E$ ECC event for block load
WDU     uncorrectable E$ ECC event for writeback (victimization)
CPU     uncorrectable E$ ECC event for copyout (snoop request)
L3_TUE_SH multiple-bit ECC error on L3 cache tag access due to copyback, or tag update from foreign Fireplane device, snoop request
L3_TUE    multiple-bit ECC error on L3 cache tag access due to core specific tag access
L3_EDU    multiple-bit ECC error on L3 cache data access for P-cache and W-cache request
L3_UCU    multiple-bit ECC error on L3 cache data access for I-cache and -cache request
L3_CPU    multiple-bit ECC error on L3 cache data access for copyout
L3_WDU    multiple-bit ECC error on L3 cache data access for writeback

Error Messaging Notes
  • When browsing messages files and observing console output note that [AFT1] is included in these messages, a 1 represents the "Asynchronous Fault Trap" for uncorrectable and unrecoverable errors. AFT0 is used for correctable errors, AFT2 and AFT3 can be ignored in almost all cases.
  • The above error messaging may change slightly depending on your kernel update patch version. 
  • It is important to understand that uncorrectable ECC errors can be reported by multiple components.  At no point will the corrupted data actually be used.
This document does not apply to Solaris[TM] 10 as FMA automates the diagnosis of these type of faults.  See <Document:1018939.1> Solaris[TM] 10 Operating System: Displaying the list of Fault Management Architecture (FMA) resources currently believed to be faulted If Solaris has not paniced, crashed, or rebooted and you are just seeing correctable errors please see <Document:1006513.1> Troubleshooting Sun Fire[TM] Correctable CPU and Memory Error(s) on Solaris[TM] 8 and 9

Last Review Date

June 3, 2011

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Steps to Follow
Please validate that each troubleshooting step below is true for your environment.

The steps will provide instructions or a link to a document, for validating the step and taking corrective action  as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.

Note:  A service case is likely to be required in almost all cases where uncorrectable errors from CPU or Memory have been observed.

1. Some known software bugs which can cause uncorrectable CPU/Memory errors

  • <Document:1001431.1> Sun Fire V480 and V880 With 900 MHz CPUs May Panic or "Red State" Due to Incorrect L2 SRAM Parameter Settings in Firmware
  • <Document:1000922.1> Sun Fire 3800/4800/4810/6800, Sun Fire 12K/15K, Sun Fire V1280, and Netra 1280 Server Domains with 900MHz CPUs May Panic or Hang Due to Incorrect L2 SRAM Parameter Settings

2. Software bugs where correctable errors can result in a panic

  • See <Document:1006513.1> Troubleshooting correctable CPU/Memory errors on Solaris[TM] 8 and 9.

3. Collect Data to allow Sun Support to progress your call

Uncorrectable errors can generate very large amounts of error information in messages files. Diagnosing any fault from looking at a small number of messages, when a thousand have been reported greatly increases the chances of misdiagnosis. On the midrange and high end platforms the System Controllers capture extensive hardware level failure data which is also important.

Collect at minimum, the following:

  • For diagnosis of the error:
    • /var/adm/messages
    • uname -a
      • To confirm that you are not hitting known error reporting bugs
  • So that the correct FRU can be ordered if required:
    • prtdiag -v
      • Required to see what FRUs are installed.
      • Also contains the OBP revision, for the OBP you can also use prtconf -V
    • prtfru -x
      • FRU part and serial numbers required for some FCO checks and to confirm if a FRU is RoHS or not.
      • On the 3800-6900 class systems the prtfru -x output can only be collected using an explorer
Alternatively it is much easier to send an explorer which collects all of the above data and more:
  • Download latest explorer version from <Document:1002383.1> Oracle Explorer Data Collector
    • For the 1280, 1290 and E2900 systems use the command /opt/SUNWexpo/bin/explorer -w default,1280extended
      • See <Document:1009102.1> Sun Fire[TM] V1280, E2900 And Netra[TM] 1280 Servers: Gathering 1280extended Information Using Sun[TM] Explorer Software
    • For the 3800, 4810, 4800, E4900, 6800, and E6900s systems use the command /opt/SUNWexpo/bin/explorer -w default,scextended,fru
      • See <Document:1011830.1> Sun Fire[TM] Servers (3800/4800/4810/6800/E4900/E6900): How to run scextended Explorer
    • If needing to collect loghost data or needing to configure a loghost, see <Document:1008676.1> 'Best Practices' and configuring loghost on Sun Fire[TM] 3800,4800,4900,6800, and E6900 servers
    • For the 12K/15K/E20K/E25K run an explorer with no options on the Domain and Main System Controller.

Note: Explorer version 5.6 and above will take a lot less time to capture the required data due to bugs fixed in the scextended and sf15k modules.


To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in an appropriate My Oracle Support Community, Oracle Sun Technologies Community.

Internal Only Troubleshooting Information

Uncorrectable errors travel
A UE event that begins in memory will be reported by at least one CPU and in many cases multiple CPUs.  A UE event that begins with a CPU will be written to memory and then read by other CPUs. As such, diagnosing from the last error message reported prior to a panic will always result in misdiagnosis, unless you are very lucky. Diagnosing from the first error message reported will often result in misdiagnosis, though this is more reliable. What is actually required (for best results) is to look at all error messages and make a diagnosis based on this information.

Ensure you have as much of the required information as possible and follow the steps detailed below:
1. Determine if you are dealing with a Memory or CPU fault.

  • See <Document:1006520.1> How to identify whether you have a Memory or  CPU ECC Uncorrectable Fault
2. Based on the results of Step 1, you will then diagnose either the CPU or Memory event.
  • <Document:1002430.1> Diagnose Sun Fire[TM] Uncorrectable Error(s) from Memory on Solaris[TM] 8 and 9
  • <Document:1009371.1> Diagnose Sun Fire[TM] Uncorrectable Error(s) from a CPU on Solaris[TM] 8 and 9
3. If FRU replacement has not fixed the fault see the following reference.
  • <Document:1012314.1> What to look for if hardware errors persist after an onsite visit

Not sure of what is at fault?

It is much better to raise an escalation for diagnostic assistance before changing parts. Never simply re-enable parts and re-test FRUs without first understanding why the components were disabled in the first place.


Previously Published As 83224


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback