Document Audience:INTERNAL
Document ID:I0954-1
Title:Diagnosing Main Memory errors versus L2SRAM errors on Sun Fire 3800/4800/4810/6800, Sun Fire 12K/15K and Sun Fire V1280 systems. SunAlert: No
Copyright Notice:Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved
Update Date:2003-04-09

---------------------------------------------------------
            - Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                        FIELD INFORMATION NOTICE
               (For Authorized Distribution by SunService)
FIN #: I0954-1
Synopsis: Diagnosing Main Memory errors versus L2SRAM errors on Sun Fire 3800/4800/4810/6800, Sun Fire 12K/15K and Sun Fire V1280 systems. SunAlert: No
Create Date: Apr/07/03
SunAlert: No
Top FIN/FCO Report: No
Products Reference: Sun Fire 3800/4800/4810/6800/12K/15K/V1280
Product Category: Server / Service
Product Affected: 
Systems Affected:
-----------------  
Mkt_ID     Platform     Model      Description        Serial Number
------     --------     -----      -----------        -------------
  -          LW8         ALL       Netra 1280               -
  -          A40         ALL       Sun Fire V1280           -
  -          S8          ALL       Sun Fire 3800            -
  -          S12         ALL       Sun Fire 4800            -
  -          S12i        ALL       Sun Fire 4810            -
  -          S24         ALL       Sun Fire 6800            -
  -          F12K        ALL       Sun Fire 12000           -
  -          F15K        ALL       Sun Fire 15000           -  


X-Options Affected:
-------------------
Mkt_ID     Platform   Model   Description        Serial Number
------     --------   -----   -----------        -------------
  -           -	        -	   -                   -


Part Number   Description   	Model
-----------   -----------   	-----
     -             -	          -
References: 
BugID:     4829924 - DUE followed by EDU:ST can be reported in reverse 
                     order. 
           4830028 - US-III L2 SRAM messages need improvement.

PatchID:   108528-18 or higher : SunOS 5.8: kernel update patch 
           112233-04: SunOS 5.9: Kernel Patch

FIN:       I0909-2

URL:       http://onestop/programs/us3quality

Sun Alert: 50471

Infodoc:   43642
Issue Description: 
On Sun Fire 3800/4800/4810/6800, Sun Fire 12K/15K and Sun Fire V1280
systems, main memory DIMM errors may potentially be misdiagnosed as
L2SRAM errors or L2SRAM errors may be misdiagnosed as main memory
errors.  This may result in the wrong component being replaced, leaving
the system vulnerable to future failures.

L2 SRAM errors may occur when accessing the CPU's Level 2 SRAM cache
memory.  The reports of errors vary with workload and data patterns.
One type of known L2SRAM issue is described in Sun Alert 50471.  See
the Corrective Action section for the recommended SMS, firmware, and
Solaris kernel patches required to resolve this L2SRAM timing issue.

There is a type of error condition that may be generated by a main
memory DIMM that is being propagated to the L2SRAM which results in a
system panic.  When reviewing the messages file, the last entry in the
/var/adm/messages file appears to call out a bad L2SRAM when in fact
the source of the problem is a DIMM, which may even reside on a
completely different system board than that L2SRAM that is reporting
the error.  These types of errors have been observed to occur when an
ECC error occurs on a "memory prefetch" operation followed by ECC
errors on the associated data.

If the troubleshooting engineer does not perform due diligence on
problem diagnostics and only looks at the last entry in the messages
file, this may lead the engineer to recommend replacing the L2SRAM
(i.e. system board) when in fact the DIMM was the source of the error.


DIMM Errors Misdiagnosed as L2SRAM Errors
=========================================

  NOTE: Refer to Internal Infodoc 43642 for abbreviated definitions 
        for errors such as DUE, EDU, etc.

  Syndrome 0x003 errors (i.e. "*Bad* Esynd=0x003") that have a "DUE" event:

    (i.e. WARNING: [AFT1] DUE Event on CPU) and an "EDU:ST" event 
    (i.e. WARNING: [AFT1] EDU:ST Event on CPU) in the /var/adm/messages file.  

  In some cases, the last error captured in the log file that originated
  from an DIMM error may even be reported as a syndrome 0x071 (i.e.
  "*Bad* Esynd=0x071") which is typically associated as an L2SRAM
  failure, but in these cases it is not an L2SRAM failure.

  Syndrome 0x071 errors are almost always caused by a prior uncorrectable
  error, which may be from either L2SRAM or Main Memory.  You need to
  find the original uncorrectable error that caused the corruption.

  Syndrome 0x003 and Syndrome 0x11c errors errors should be ignored if
  the error is a UCU, EDU, WDU or CPU Event, but not if the error is a
  DUE or UE.  If a syndrome 0x003 or 0x11c error in L2SRAM is flushed to
  memory, it is turned into a Syndrome 0x071 error in memory.

  The point to be noted when such cases are encountered is that the
  device reporting the DUE (not to be confused with the EDU and EDU:ST
  also reported during this event) is the source of the error and the
  device reporting the EDU:ST (not to be confused by just an EDU) is the
  recipient of the error.  The diagnostic engineer must retrace the
  events that lead up to the error as in most cases, the EDU:ST is the
  last entry in the /var/adm/messages file, which might lead to replacing
  the recipient of the error, and not the source of the error.

  It is also important to note that in most cases the DUE (source or the
  error) will precede the EDU:ST (destination of the error) in the
  messages file, but this may not always be the case.  What is important
  is that when the DUE and the EDU:ST are seen together, the DUE is
  reporting the source of the error and the EDU:ST is the destination of
  the error.  Proper matching can be performed by examining the AFAR
  associated with each event.  Events with AFARs that are the same when
  rounded down to either a 32-byte or a 64-byte boundary can be
  associated with each other, irrespective of the order in which they
  occur.  The EDU:ST syndrome will also match the DUE syndrome, or be one
  of the special syndromes of 0x003 or 0x071.

  If multiple DUEs occur to different AFARs, multiple EDU:STs may be
  interleaved among them.

  It is also important to note that sometimes what is really an EDU:ST
  will be reported as a plain EDU.  This has been seen when both the ME
  and EDU bits are on in the AFSR, and also when both the DUE and EDU
  bits are on in the AFSR.  There may be other combinations of bits that
  will cause an EDU:ST to masquerade as a plain EDU.  A plain EDU can be
  assumed to be an EDU:ST if it has the same errID as a DUE or if it can
  be matched via its AFAR and syndrome to a DUE.

  Note that one can substitute ordinary UEs for DUEs above, and the same
  rules apply; if an EDU:ST is associated (by AFAR) with a UE, ignore
  it.  Because UEs are more likely to bring the system down right away,
  however, the likelihood of misdiagnosis due to associated EDU:STs is
  less.

  Note also when this condition occurs, the error being exhibited on the
  victim L2SRAM may be reported on one or more L2SRAM's which may span
  one or more system boards.

  When reviewing the messages that call out L2SRAM during this condition,
  the EDU, EDU:ST, WDU, CPU, UCU, and UE event messages may display error
  text such as the following:

       "likely from E$ WDU/CPU"
       "likely from E$ EDU:ST" 

  While the message is technically correct, as stated previously, the
  L2SRAM is the victim of the error and the error originated from main
  memory. 

L2SRAM Errors Misdiagnosed as DIMM Errors
=========================================

  It should also be noted that blind application of the above could lead
  one to misdiagnose true L2SRAM errors as DIMM errors.  A main memory
  DIMM UE or DUE with a syndrome of 0x71 is most likely a secondary error
  and should be ignored for diagnostic purposes.

  Similarly, L2SRAM xxU events with syndromes of 0x003, 0x071, and 0x11c
  are most likely secondary errors and should be ignored.  An L2SRAM xxU
  event with a syndrome other than 0x003, 0x071, or 0x11c, that can be
  matched with a UE or DUE event with the same AFAR (rounded down to
  either a 32-byte or a 64-byte boundary), especially if it has the same
  syndrome, should also be ignored.

  A UE or DUE with a syndrome other than 0x071 may indicate a possible
  memory error, but needs to be correlated with the F15K recordstop or
  F3800/4800/4810/6800 loghost logs information to make sure of this.
  Similarly, an L2SRAM xxU event with a syndrome other than 0x003, 0x071,
  or 0x11c that cannot be matched with a UE or DUE event with the same
  AFAR (rounded down to either a 32-byte or a 64-byte boundary) AND the
  same syndrome may indicate a possible L2SRAM error, but again the
  recordstop (F15K) or loghost logs (F3800/4800/4810/6800) information
  needs to be checked to make sure of this.

  Sometimes correctable errors are also reported identifying a particular
  DIMM in the same bank as identified by the UE or DUE reports.  This can
  help focus attention on a suspect DIMM.

  The associated recordstop file on the F15K server and the loghost logs
  3800/4800/4810/6800 systems contain additional messages which are
  critical in proper diagnosis of the L2SRAM and DIMM errors.


  The following examples are indicative of main memory DIMM errors that 
  could be misdiagnosed as L2SRAM errors.

  EXAMPLE 1:
  ==========

    The following is an example of the condition where the last device
    reported in the messages file is an L2SRAM, but in fact the error
    originated on a DIMM.  Each section contains a description of what
    the /var/adm/messages file is reporting and what the diagnosing
    engineer should be reviewing.

-------------------------------------------------------------------------------

  Section 1:
  ----------

  A Correctable Error (CE) condition occurred on /N0/SB4/P2/B1/D2  DIMM
  J15501 Bit 116 is identified as the troublesome bit.  In the cache
  dump it has the value "1", which means it read as "0" before being
  corrected.

  Feb  6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 226472 kern.notice] NOTICE: 
     [AFT0] Corrected system bus (CE) Event on CPU18 at TL=0, errID 
     0x0000240b.240fca00
  Feb  6 17:46:23 la001   AFSR 0x00000002.00000070 AFAR 0x00000000.35e61780
  Feb  6 17:46:23 la001   Fault_PC 0x1009ebc8 Esynd 0x0070 /N0/SB4/P2/B1/D2 
     J15501
  Feb  6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 991331 kern.notice] 
     [AFT0] errID 0x0000240b.240fca00 Corrected Memory Error on /N0/SB4/P2/B1/D2 
     J15501 is Intermittent
  Feb  6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 477466 kern.notice] 
     [AFT0] errID 0x0000240b.240fca00 Data Bit 116 was in error and corrected
  Feb  6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 135948 kern.info] 
     [AFT2] errID 0x0000240b.240fca00 PA=0x00000000.35e61780
  Feb  6 17:46:23 la001     E$tag 0x00000000.d7124924 E$state_6 Modified
  Feb  6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x00) 0x6d345f73.68617265 0x61726773.006c6d5f ECC 0x17e
  Feb  6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x10) 0x6164645f.626c6f63 0x6b006c6d.5f676c6f ECC 0x121
  Feb  6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x20) 0x62616c5f.6e6c6d69 0x64005f69.6e697400 ECC 0x0d2
  Feb  6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x30) 0x7864725f.6e6c6d5f 0x6c6f636b.61726773 ECC 0x15c
  Feb  6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info] 
     [AFT2] D$ data not available
  Feb  6 17:46:23 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info] 
     [AFT2] I$ data not available
-------------------------------------------------------------------------------

  Section 2:
  ----------

  Another Correctable Error (CE) occurs on /N0/SB4/P2/B1/D2 DIMM
  J15501.  This time bit 117 is identified as the troublesome bit.  It
  has the value "0" in the cache dump, which means it read as "1"
  before being corrected.

  Feb  6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 621556 kern.notice] NOTICE: 
     [AFT0] Corrected system bus (CE) Event on CPU18 at TL=0, errID 
     0x0000240b.244ed150
  Feb  6 17:46:30 la001   AFSR 0x00000002.000001e8 AFAR 0x00000000.3549db90
  Feb  6 17:46:30 la001   Fault_PC 0x100336e4 Esynd 0x01e8 /N0/SB4/P2/B1/D2 
     J15501
  Feb  6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 700027 kern.notice] 
     [AFT0] errID 0x0000240b.244ed150 Corrected Memory Error on /N0/SB4/P2/B1/D2 
     J15501 is Intermittent
  Feb  6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 714893 kern.notice] 
     [AFT0] errID 0x0000240b.244ed150 Data Bit 117 was in error and corrected
  Feb  6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 195358 kern.info] 
     [AFT2] errID 0x0000240b.244ed150 PA=0x00000000.3549db80
  Feb  6 17:46:30 la001     E$tag 0x00000000.d5900124 E$state_6 Modified
  Feb  6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x00) 0x00000000.00000000 0x00000300.0a8fa970 ECC 0x00f
  Feb  6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x10) 0x00000001.0c43db60 0x00000300.0bdab890 ECC 0x154
  Feb  6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x20) 0x00000000.00000000 0x00600188.7b730000 ECC 0x041
  Feb  6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x30) 0x00000000.00000000 0x00000300.0a4fa998 ECC 0x014
  Feb  6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info] 
     [AFT2] D$ data not available
  Feb  6 17:46:30 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info] 
     [AFT2] I$ data not available
-------------------------------------------------------------------------------

  Section 3:
  ----------

  A DUE and an EDU:ST occur together.  Because they occur together,
  Solaris reports the EDU:ST as a plain EDU event.  It also prints the
  EDU report first, even though the DUE actually occurred first.  We
  know they occurred together because both reports have the same errID
  (0x0000240e.76f08db0), so we assume they are matched.

  The DUE (Uncorrectable system bus data ECC for prefetch queue) event
  calls out DIMM bank /N0/SB4/P2/B1.  CPU9 and its L2 bank
  /N0/SB2/P1/E0 J5400 are innocent victims of the DUE.

  Note that the cache dump (the lines containing the string "E$Data")
  shows two syndromes, the 0x1b6 on the even checkwords, that is also
  reported in the AFSR, and an 0x02d on the odd checkwords.  This is
  the data as CPU9 received it.

  Note also the "5" and the "a" in the third nibble from the left in
  each checkword.  This is the nibble that contains bits 116 and 117,
  which were identified as troublesome in the CE reports, although now
  it appears that bits 118 and 119 may also be affected in the even and
  odd checkwords, respectively.  (A syndrome of 0x1b6 is consistent
  with a flip in both data bit 116 and 118.  Similarly, a syndrome of
  0x02d is consistent with a flip in both data bit 117 and 119.)

  The Invalid AFAR message can be ignored.  It is an artifact of a
  misunderstanding between Solaris and the CPU that will be fixed in a
  future release.

  Feb  6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 487947 kern.warning] 
     WARNING: [AFT1] EDU Event on CPU9 at TL=0, errID 0x0000240e.76f08db0
  Feb  6 17:46:37 la001     AFSR 0x00500000.000001b6 AFAR 
     0x00000000.35dd9780 AMBIGUOUS
  Feb  6 17:46:37 la001     Fault_PC 0x1000ba50 Esynd 0x01b6 AMBIGUOUS 
     /N0/SB2/P1/E0 J5400
  Feb  6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 907614 kern.notice] 
     [AFT1] errID 0x0000240e.76f08db0 Two Bits were in error
  Feb  6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 693922 kern.info] 
     [AFT2] errID 0x0000240e.76f08db0 PA=0x00000000.35dd9780
  Feb  6 17:46:37 la001     E$tag 0x00000000.d7124924 E$state_6 Modified
  Feb  6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x00) 0x00500000.1044c758 0x00000000.00000008 ECC 0x082 
     *Bad* Esynd=0x1b6
  Feb  6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x10) 0x00a029cf.0100fff1 0x00000000.10428830 ECC 0x1e3 
     *Bad* Esynd=0x02d
  Feb  6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x20) 0x00500000.00000008 0x000029dd.0200fff1 ECC 0x1a7 
     *Bad* Esynd=0x1b6
  Feb  6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x30) 0x00a00000.10016cf4 0x00000000.00000024 ECC 0x076 
     *Bad* Esynd=0x02d
  Feb  6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info] 
     [AFT2] D$ data not available
  Feb  6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info] 
     [AFT2] I$ data not available
  Feb  6 17:46:37 la001 unix: [ID 321153 kern.notice] NOTICE: Scheduling 
     clearing of error on page 0x00000000.35dd8000
  Feb  6 17:46:37 la001 unix: [ID 221039 kern.notice] NOTICE: Previously 
     reported error on page 0x00000000.35dd8000 cleared
  Feb  6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 583311 kern.warning] 
     WARNING: [AFT1] DUE Event on CPU9 at TL=0, errID 0x0000240e.76f08db0
  Feb  6 17:46:37 la001     AFSR 0x00500000.000001b6 AFAR 
     0x00000000.35dd9780
  Feb  6 17:46:37 la001    Fault_PC 0x1000ba50 Esynd 0x01b6  /N0/SB4/P2/B1
  Feb  6 17:46:37 la001 SUNW,UltraSPARC-III+: [ID 907614 kern.notice] 
     [AFT1] errID 0x0000240e.76f08db0 Two Bits were in error
  Feb  6 17:46:37 la001 unix: [ID 321153 kern.notice] 
     NOTICE: Scheduling clearing of error on page 0x00000000.35dd8000
  Feb  6 17:46:38 la001 unix: [ID 221039 kern.notice] 
     NOTICE: Previously reported error on page 0x00000000.35dd8000 cleared
  Feb  6 17:46:38 la001 SUNW,UltraSPARC-III+: [ID 647234 kern.warning] 
     WARNING: [AFT1] Invalid AFSR CPU9  at TL=0, errID 0x0000240e.76f377a0
  Feb  6 17:46:38 la001     AFSR 0x00000000.00000000 AFAR 0x00000000.35dd9780 
     INVALID
  Feb  6 17:46:38 la001     Fault_PC 0x1009ebc4
-------------------------------------------------------------------------------

  Section 4:
  ----------

  Following the DUE, an EDU:ST (Uncorrectable Ecache data ECC error for
  store merge or block load or prefetch queue operation) is reported
  calling out /N0/SB2/P1/E1, but this is the destination of the error.

  Note that the AFAR is 0x00000000.35dd9790, which is the same as the
  DUE AFAR (0x00000000.35dd9780) when both are rounded down to a
  32-byte boundary.  It also has an Esynd of 0x0003.

  Feb  6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 911731 kern.warning] 
     WARNING: [AFT1] EDU:ST Event on CPU9 at TL=0, errID 0x0000240e.76fd8150
  Feb  6 17:46:45 la001     AFSR 0x00000008.00000003 AFAR 
     0x00000000.35dd9790
  Feb  6 17:46:45 la001     Fault_PC 0x1009eb4c Esynd 0x0003 /N0/SB2/P1/E1 
     J5300
  Feb  6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 662991 kern.notice] 
     [AFT1] errID 0x0000240e.76fd8150 Two Bits were in error
  Feb  6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 102248 kern.info] 
     [AFT2] errID 0x0000240e.76fd8150 PA=0x00000000.35dd9780
  Feb  6 17:46:45 la001     E$tag 0x00000000.d7924924 E$state_6 Modified
  Feb  6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x00) 0x00000000.1044c758 0x00000000.00000008 ECC 0x081 
     *Bad* Esynd=0x003
  Feb  6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x10) 0x000029cf.0100fff1 0x00000000.10428830 ECC 0x1e0 
     *Bad* Esynd=0x003
  Feb  6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x20) 0x00000000.00000008 0x000029dd.0200fff1 ECC 0x1a7
  Feb  6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x30) 0x00000000.10016cf4 0x00000000.00000024 ECC 0x076
  Feb  6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info] 
     [AFT2] D$ data not available
  Feb  6 17:46:45 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info] 
     [AFT2] I$ data not available
  Feb  6 17:46:45 la001 unix: [ID 321153 kern.notice] 
     NOTICE: Scheduling clearing of error on page 0x00000000.35dd8000
  Feb  6 17:46:45 la001 unix: [ID 868141 kern.warning] 
     WARNING: Uncorrectable Error occurred at PA 0x00000000.35dd9780 while 
     attempting to clear previously reported error; page removed from service
-------------------------------------------------------------------------------

  Section 5:
  ----------

  This is similar to the messages in Section 3.  An EDU and DUE are
  reported together (same errID), as is an Invalid AFSR that can be
  ignored.  The DUE indicts /N0/SB4/P2/B1.  Note that the AFAR
  (0x00000000.35de5790) is different from the Section 3 AFAR
  (0x00000000.35dd9780), but the syndromes in the third and fourth
  checkwords are similar (0x1b6 and 0x02d, respectively).  This is
  consistent with a bad DRAM on a memory DIMM, but it is also consistent
  with a component writing bad data into memory.  The recordstop logs and
  loghost logs need to be consulted to determine the true source of the
  error.

  Feb  6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 638863 kern.warning] 
     WARNING: [AFT1] EDU Event on CPU9 at TL=0, errID 0x0000240e.77173e60
  Feb  6 17:46:51 la001     AFSR 0x00500000.0000002d AFAR 
     0x00000000.35de5790 AMBIGUOUS
  Feb  6 17:46:51 la001     Fault_PC 0x100071bc Esynd 0x002d AMBIGUOUS 
     /N0/SB2/P1/E1 J5300
  Feb  6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 432893 kern.notice] 
     [AFT1] errID 0x0000240e.77173e60 Two Bits were in error
  Feb  6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 892145 kern.info] 
     [AFT2] errID 0x0000240e.77173e60 PA=0x00000000.35de5780
  Feb  6 17:46:51 la001     E$tag 0x00000000.d7124924 E$state_6 Modified
  Feb  6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x00) 0x00000000.1012529c 0x00000000.00000018 ECC 0x0eb 
     *Bad* Esynd=0x003
  Feb  6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x10) 0x00a18d18.0100fff1 0x00000000.104a41e0 ECC 0x175 
     *Bad* Esynd=0x003
  Feb  6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x20) 0x00500000.00000008 0x00018d49.0200fff1 ECC 0x150 
     *Bad* Esynd=0x1b6
  Feb  6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x30) 0x00a00000.100940bc 0x00000000.000000d0 ECC 0x08e 
     *Bad* Esynd=0x02d
  Feb  6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info] 
     [AFT2] D$ data not available
  Feb  6 17:46:51 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info] 
     [AFT2] I$ data not available
  Feb  6 17:46:51 la001 unix: [ID 321153 kern.notice] 
     NOTICE: Scheduling clearing of error on page 0x00000000.35de4000
  Feb  6 17:46:52 la001 unix: [ID 221039 kern.notice] 
     NOTICE: Previously reported error on page 0x00000000.35de4000 cleared
  Feb  6 17:46:52 la001 SUNW,UltraSPARC-III+: [ID 998431 kern.warning] 
     WARNING: [AFT1] DUE Event on CPU9 at TL=0, errID 0x0000240e.77173e60
  Feb  6 17:46:52 la001     AFSR 0x00500000.0000002d AFAR 
     0x00000000.35de5790
  Feb  6 17:46:52 la001     Fault_PC 0x100071bc Esynd 0x002d  /N0/SB4/P2/B1
  Feb  6 17:46:52 la001 SUNW,UltraSPARC-III+: [ID 432893 kern.notice] 
     [AFT1] errID 0x0000240e.77173e60 Two Bits were in error
  Feb  6 17:46:52 la001 unix: [ID 321153 kern.notice] 
     NOTICE: Scheduling clearing of error on page 0x00000000.35de4000
  Feb  6 17:46:52 la001 unix: [ID 221039 kern.notice] 
     NOTICE: Previously reported error on page 0x00000000.35de4000 cleared
  Feb  6 17:46:52 la001 SUNW,UltraSPARC-III+: [ID 379899 kern.warning] 
     WARNING: [AFT1] Invalid AFSR CPU9  at TL=0, errID 0x0000240e.77197090
  Feb  6 17:46:52 la001     AFSR 0x00000000.00000000 AFAR 0x00000000.35de5790 
     INVALID
  Feb  6 17:46:52 la001     Fault_PC 0x1009ebc4
-------------------------------------------------------------------------------

  Section 6:
  ----------

  Similar to Section 4, this is a subsequent EDU:ST which follows the
  Section 5 DUE event.  The EDU:ST AFAR (0x00000000.35de5790) is
  identical to the Section 5 DUE AFAR (0x00000000.35de5790).

  Feb  6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 254914 kern.warning] 
     WARNING: [AFT1] EDU:ST Event on CPU9 at TL=0, errID 0x0000240e.771e6500
  Feb  6 17:46:59 la001     AFSR 0x00000008.00000003 AFAR 
     0x00000000.35de5790
  Feb  6 17:46:59 la001     Fault_PC 0x1009eb4c Esynd 0x0003 /N0/SB2/P1/E1 
     J5300
  Feb  6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 402940 kern.notice] 
     [AFT1] errID 0x0000240e.771e6500 Two Bits were in error
  Feb  6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 317376 kern.info] 
     [AFT2] errID 0x0000240e.771e6500 PA=0x00000000.35de5780
  Feb  6 17:46:59 la001     E$tag 0x00000000.d7924924 E$state_6 Modified
  Feb  6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x00) 0x00000000.1012529c 0x00000000.00000018 ECC 0x0eb 
     *Bad* Esynd=0x003
  Feb  6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x10) 0x00018d18.0100fff1 0x00000000.104a41e0 ECC 0x158 
     *Bad* Esynd=0x003
  Feb  6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x20) 0x00000000.00000008 0x00018d49.0200fff1 ECC 0x150
  Feb  6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x30) 0x00000000.100940bc 0x00000000.000000d0 ECC 0x08e
  Feb  6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info] 
     [AFT2] D$ data not available
  Feb  6 17:46:59 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info] 
     [AFT2] I$ data not available
  Feb  6 17:46:59 la001 unix: [ID 321153 kern.notice] 
     NOTICE: Scheduling clearing of error on page 0x00000000.35de4000
  Feb  6 17:47:00 la001 unix: [ID 221039 kern.notice] 
     NOTICE: Previously reported error on page 0x00000000.35de4000 cleared

-------------------------------------------------------------------------------

  Section 7:
  ----------

  This is similar to Sections 3 and 5.  The AFAR is 0x00000000.35de9780.
  Note the same pattern of Esynd in the cache dump.

  Feb  6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 498093 kern.warning] 
     WARNING: [AFT1] EDU Event on CPU9 at TL=0, errID 0x0000240e.772342f0
  Feb  6 17:47:06 la001     AFSR 0x00500000.000001b6 AFAR 
     0x00000000.35de9780 AMBIGUOUS
  Feb  6 17:47:06 la001     Fault_PC 0x1000ba50 Esynd 0x01b6 AMBIGUOUS 
     /N0/SB2/P1/E0 J5400
  Feb  6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 279337 kern.notice] 
     [AFT1] errID 0x0000240e.772342f0 Two Bits were in error
  Feb  6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 317260 kern.info] 
     [AFT2] errID 0x0000240e.772342f0 PA=0x00000000.35de9780
  Feb  6 17:47:06 la001     E$tag 0x00000000.d7124924 E$state_6 Modified
  Feb  6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x00) 0x0051f505.0100fff1 0x00000000.104eb230 ECC 0x126 
     *Bad* Esynd=0x1b6
  Feb  6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x10) 0x00a00000.000001b0 0x0001f50e.0200fff1 ECC 0x06f 
     *Bad* Esynd=0x02d
  Feb  6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x20) 0x00500000.1018ecd8 0x00000000.000000c4 ECC 0x1c6 
     *Bad* Esynd=0x1b6
  Feb  6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x30) 0x00a1f51f.0200fff1 0x00000000.101a0388 ECC 0x05f 
     *Bad* Esynd=0x02d
  Feb  6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info] 
     [AFT2] D$ data not available
  Feb  6 17:47:06 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info] 
     [AFT2] I$ data not available
  Feb  6 17:47:06 la001 unix: [ID 321153 kern.notice] 
     NOTICE: Scheduling clearing of error on page 0x00000000.35de8000
  Feb  6 17:47:07 la001 unix: [ID 221039 kern.notice] 
     NOTICE: Previously reported error on page 0x00000000.35de8000 cleared
  Feb  6 17:47:07 la001 SUNW,UltraSPARC-III+: [ID 214841 kern.warning] 
     WARNING: [AFT1] DUE Event on CPU9 at TL=0, errID 0x0000240e.772342f0
  Feb  6 17:47:07 la001     AFSR 0x00500000.000001b6 AFAR 
     0x00000000.35de9780
  Feb  6 17:47:07 la001     Fault_PC 0x1000ba50 Esynd 0x01b6  /N0/SB4/P2/B1
  Feb  6 17:47:07 la001 SUNW,UltraSPARC-III+: [ID 279337 kern.notice] 
     [AFT1] errID 0x0000240e.772342f0 Two Bits were in error
  Feb  6 17:47:07 la001 unix: [ID 321153 kern.notice] 
     NOTICE: Scheduling clearing of error on page 0x00000000.35de8000
  Feb  6 17:47:07 la001 unix: [ID 221039 kern.notice] 
     NOTICE: Previously reported error on page 0x00000000.35de8000 cleared
  Feb  6 17:47:07 la001 SUNW,UltraSPARC-III+: [ID 736351 kern.warning] 
     WARNING: [AFT1] Invalid AFSR CPU9  at TL=0, errID 0x0000240e.77257750
  Feb  6 17:47:07 la001     AFSR 0x00000000.00000000 AFAR 0x00000000.35de9780 
     INVALID
  Feb  6 17:47:07 la001     Fault_PC 0x1009ebc4
-------------------------------------------------------------------------------

  Section 8:
  ----------

  Again, related to Section 7 the way Sections 4 and 6 relate to Sections 3 
  and 5, respectively.

  Feb  6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 945805 kern.warning] 
     WARNING: [AFT1] EDU:ST Event on CPU9 at TL=0, errID 0x0000240e.7726bb10
  Feb  6 17:47:14 la001     AFSR 0x00000008.00000003 AFAR 
     0x00000000.35de9790
  Feb  6 17:47:14 la001     Fault_PC 0x1009eb4c Esynd 0x0003 /N0/SB2/P1/E1 
     J5300
  Feb  6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 260570 kern.notice] 
     [AFT1] errID 0x0000240e.7726bb10 Two Bits were in error
  Feb  6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 887708 kern.info] 
     [AFT2] errID 0x0000240e.7726bb10 PA=0x00000000.35de9780
  Feb  6 17:47:14 la001     E$tag 0x00000000.d7924924 E$state_6 Modified
  Feb  6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x00) 0x0001f505.0100fff1 0x00000000.104eb230 ECC 0x125 
     *Bad* Esynd=0x003
  Feb  6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x10) 0x00000000.000001b0 0x0001f50e.0200fff1 ECC 0x06c 
     *Bad* Esynd=0x003
  Feb  6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x20) 0x00000000.1018ecd8 0x00000000.000000c4 ECC 0x1c6
  Feb  6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x30) 0x0001f51f.0200fff1 0x00000000.101a0388 ECC 0x05f
  Feb  6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 929717 kern.info] 
     [AFT2] D$ data not available
  Feb  6 17:47:14 la001 SUNW,UltraSPARC-III+: [ID 335345 kern.info] 
     [AFT2] I$ data not available
  Feb  6 17:47:14 la001 unix: [ID 321153 kern.notice] NOTICE: 
     Scheduling clearing of error on page 0x00000000.35de8000
  Feb  6 17:47:14 la001 unix: [ID 221039 kern.notice] 
     NOTICE: Previously reported error on page 0x00000000.35de8000 cleared
-------------------------------------------------------------------------------

If the troubleshooting engineer only looked at the last entry, they
might incorrectly conclude that the L2SRAM is at fault when in fact the
DIMM /N0/SB4/P2/B1/D2  (i.e. J15501) which reported the first CE error
is in fact the suspect part.  For this case, the corrective action was
to replace DIMM /N0/SB4/P2/B1/D2  (i.e. J15501) which resolved the
problem.


  Example 2:
  ==========

    The following is a second example of the problem where the last
    recorded entry looks like an L2SRAM problem due to the syndrome
    0x071 error message, but the error originated in the DIMM.  Only a
    few of the error messages from the messages file are shown.

---------------------------------------------------------------------------------

  Section 1:
  ----------

  A CE event is recorded on a DIMM read calling out SB3/P2/B0/D1 J15400 
  on data bit 120.

  Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 964002 kern.info] 
     NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU449 at 
     TL=0, errID 0x0003c931.296a7ee0
  Feb 19 14:22:25 ht01da     AFSR 0x00000002.00000068 AFAR 
     0x00000061.eab2e2a0
  Feb 19 14:22:25 ht01da     Fault_PC 0x104fe04 Esynd 0x0068 SB3/P2/B0/D1 
     J15400
  Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 270288 kern.info] 
     [AFT0] errID 0x0003c931.296a7ee0 Corrected Memory Error on SB3/P2/B0/D1 
     J15400 is Intermittent
  Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 634372 kern.info] 
     [AFT0] errID 0x0003c931.296a7ee0 Data Bit 120 was in error and corrected
  Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 491712 kern.info] 
     [AFT2] errID 0x0003c931.296a7ee0 PA=0x00000061.eab2e280
  Feb 19 14:22:25 ht01da     E$tag 0x00000187.aa000124 E$state_2 Modified
  Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x00) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
  Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x10) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
  Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x20) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
  Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x30) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
  Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 929717 kern.info] 
     [AFT2] D$ data not available
  Feb 19 14:22:25 ht01da SUNW,UltraSPARC-III+: [ID 335345 kern.info] 
     [AFT2] I$ data not available
  Feb 19 14:22:25 ht01da unix: [ID 868141 kern.warning] 
     WARNING: Uncorrectable Error occurred at PA 0x00000061.eab2e280 while 
     attempting to clear previously reported error; page removed from service
  Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x20) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
  Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x30) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
  Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 929717 kern.info] 
     [AFT2] D$ data not available
  Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 335345 kern.info] 
     [AFT2] I$ data not available
  Feb 19 14:22:26 ht01da unix: [ID 321153 kern.notice] 
     NOTICE: Scheduling clearing of error on page 0x00000061.eab2e000
-------------------------------------------------------------------------------

  Section 2:
  ----------

  In this case, the DUE is reported after the EDU:ST, however the
  important thing to note is that there is a DUE and EDU:ST pair.  The
  EDU:ST is showing the recipient of the bad data, not the source.  The
  recipient of the bad data is SB14/P1/E0 J5400.

  Note that the AFAR is the same as the above CE AFAR, when both are down
  to a 64-byte boundary.

  Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 947761 kern.warning] 
     WARNING: [AFT1] EDU:ST Event detected by CPU449 at TL=0, errID 
     0x0003c931.297eb2d5
  Feb 19 14:22:26 ht01da     AFSR 0x00000008.0000018c AFAR 
     0x00000061.eab2e280
  Feb 19 14:22:26 ht01da     Fault_PC 0x1007580 Esynd 0x018c SB14/P1/E0 
     J5400
  Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 119791 kern.notice] 
     [AFT1] errID 0x0003c931.297eb2d5 Two Bits were in error
  Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 643204 kern.info] 
     [AFT2] errID 0x0003c931.297eb2d5 PA=0x00000061.eab2e280
  Feb 19 14:22:26 ht01da     E$tag 0x00000187.aa000100 E$state_2 Modified
  Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x00) 0xfcff0000.00000000 0x00000000.00000000 ECC 0x134 
     *Bad* Esynd=0x18c
  Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x10) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
  Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x20) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
  Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x30) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
  Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 929717 kern.info] 
     [AFT2] D$ data not available
  Feb 19 14:22:26 ht01da SUNW,UltraSPARC-III+: [ID 335345 kern.info] 
     [AFT2] I$ data not available
  Feb 19 14:22:26 ht01da unix: [ID 321153 kern.notice] 
     NOTICE: Scheduling clearing of error on page 0x00000061.eab2e000
-------------------------------------------------------------------------------

  Section 3:
  ----------

  As mentioned in Section 2, the DUE is shown after the EDU:ST.

  Feb 19 14:22:27 ht01da SUNW,UltraSPARC-III+: [ID 879788 kern.warning] 
     WARNING: [AFT1] DUE Event detected by CPU449 at TL=0, errID 
     0x0003c931.297f3ed5
  Feb 19 14:22:27 ht01da     AFSR 0x00500000.0000018c AFAR 
     0x00000061.eab2e280
  Feb 19 14:22:27 ht01da     Fault_PC 0x104fe04 Esynd 0x018c SB3/P2/B0 
     J15300 J15400 J15500 J15600
  Feb 19 14:22:27 ht01da SUNW,UltraSPARC-III+: [ID 439813 kern.notice] 
     [AFT1] errID 0x0003c931.297f3ed5 Two Bits were in error
  Feb 19 14:22:27 ht01da unix: [ID 321153 kern.notice] 
     NOTICE: Scheduling clearing of error on page 0x00000061.eab2e000
-------------------------------------------------------------------------------

  Section 4:
  ----------

  A second EDU:ST is reported, this time with the Esynd=0x003.  But the
  AFARs match when rounded down.

  Feb 19 14:22:28 ht01da SUNW,UltraSPARC-III+: [ID 632355 kern.warning] 
     WARNING: [AFT1] EDU:ST Event detected by CPU449 at TL=0, errID 
     0x0003c931.2a71fa89
  Feb 19 14:22:28 ht01da     AFSR 0x00000008.00000003 AFAR 
     0x00000061.eab2e290
  Feb 19 14:22:28 ht01da     Fault_PC 0x104fe04 Esynd 0x0003 SB14/P1/E1 
     J5300
  Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 143088 kern.notice] 
     [AFT1] errID 0x0003c931.2a71fa89 Two Bits were in error
  Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 535859 kern.info] 
     [AFT2] errID 0x0003c931.2a71fa89 PA=0x00000061.eab2e280
  Feb 19 14:22:29 ht01da     E$tag 0x00000187.aa924900 E$state_2 Modified
  Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x00) 0xfcffffff.ffffffff 0xffffffff.ffffffff ECC 0x00c 
     *Bad* Esynd=0x003
  Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x10) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x180 
     *Bad* Esynd=0x003
  Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x20) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
  Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x30) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
  Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 929717 kern.info] 
     [AFT2] D$ data not available
  Feb 19 14:22:29 ht01da SUNW,UltraSPARC-III+: [ID 335345 kern.info] 
     [AFT2] I$ data not available
  Feb 19 14:22:29 ht01da unix: [ID 321153 kern.notice] 
     NOTICE: Scheduling clearing of error on page 0x00000061.eab2e000
-------------------------------------------------------------------------------

  Section 5:
  ----------

  Data with two bits in error was stored in the L2SRAM on board 14.
  When the cache line is evicted, it results in a WDU event and the
  syndrome 0x003.  Note that the error message tells you that this
  error likely originated from a previous error from a EDU:ST (the
  fourth line in the trace).  (The DUE that brought this line into the
  cache, and the EDU:ST that rewrote the syndrome, are not shown in
  this excerpt, but do appear in the messages file from which this
  example was taken)

  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 963611 kern.warning] 
     WARNING: [AFT1] WDU Event detected by CPU449 at TL=0, errID 
     0x0003c931.2a93c40a
  Feb 19 14:22:48 ht01da     AFSR 0x00000020.00000003 AFAR 
     0x00000061.eab2f690
  Feb 19 14:22:48 ht01da     Fault_PC 0x1170690 Esynd 0x0003 SB14/P1/E1 
     J5300
  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 744453 kern.notice] 
     [AFT1] errID 0x0003c931.2a93c40a Two Bits in error, likely from E$ 
     EDU:ST
  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 744087 kern.info] 
     [AFT2] errID 0x0003c931.2a93c40a E$tag PA=0x00000000.0032f680 does 
     not match AFAR=0x00000061.eab2f680
  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 936089 kern.info] 
     [AFT2] errID 0x0003c931.2a93c40a PA=0x00000000.0032f680
  Feb 19 14:22:48 ht01da     E$tag 0x00000000.00000000 E$state_2 Invalid
  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x00) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x180 
     *Bad* Esynd=0x003
  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x10) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x180 
     *Bad* Esynd=0x003
  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x20) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x30) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 744087 kern.info] 
     [AFT2] errID 0x0003c931.2a93c40a E$tag PA=0x000001e1.86f2f680 does not 
     match AFAR=0x00000061.eab2f680
  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 936089 kern.info] 
     [AFT2] errID 0x0003c931.2a93c40a PA=0x000001e1.86f2f680
  Feb 19 14:22:48 ht01da     E$tag 0x00000786.1b000000 E$state_2 Invalid
  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x00) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x10) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x20) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x30) 0x00000000.00000000 0x00000700.8ef58e40 ECC 0x0af
  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 929717 kern.info] 
     [AFT2] D$ data not available
  Feb 19 14:22:48 ht01da SUNW,UltraSPARC-III+: [ID 335345 kern.info] 
     [AFT2] I$ data not available
  Feb 19 14:22:48 ht01da unix: [ID 321153 kern.notice] 
     NOTICE: Scheduling clearing of error on page 0x00000061.eab2e000
------------------------------------------------------------------------------

  Section 6:
  ----------

  Other data with two bits in error was stored in the L2SRAM on board
  14.  When this cache line is evicted, it also results in a WDU event
  and the syndrome *Bad* Esynd=0x003.  (Again, the DUE that brought
  this line into the cache, and the EDU:ST that rewrote the syndrome,
  are not shown in this excerpt, but do appear in the messages file
  from which this example was taken)

  Feb 19 14:22:50 ht01da SUNW,UltraSPARC-III+: [ID 963783 kern.warning] 
     WARNING: [AFT1] WDU Event detected by CPU449 at TL=0, errID 
     0x0003c931.2a96a1cd
  Feb 19 14:22:50 ht01da     AFSR 0x00000020.00000003 AFAR 
     0x00000061.eab2ee90
  Feb 19 14:22:50 ht01da     Fault_PC 0x1170690 Esynd 0x0003 SB14/P1/E1 
     J5300
  Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 322410 kern.notice] 
     [AFT1] errID 0x0003c931.2a96a1cd Two Bits in error, likely from E$ 
     EDU:ST
  Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 555008 kern.info] 
     [AFT2] errID 0x0003c931.2a96a1cd E$tag PA=0x00000000.0032ee80 does 
     not match AFAR=0x00000061.eab2ee80
  Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 421903 kern.info] 
     [AFT2] errID 0x0003c931.2a96a1cd PA=0x00000000.0032ee80
  Feb 19 14:22:51 ht01da     E$tag 0x00000000.00000000 E$state_2 Invalid
  Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x00) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x180 
     *Bad* Esynd=0x003
  Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x10) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x180 
     *Bad* Esynd=0x003
  Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x20) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
  Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x30) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
  Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 555008 kern.info] 
     [AFT2] errID 0x0003c931.2a96a1cd E$tag PA=0x00000000.0072ee80 does 
     not match AFAR=0x00000061.eab2ee80
  Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 421903 kern.info] 
     [AFT2] errID 0x0003c931.2a96a1cd PA=0x00000000.0072ee80
  Feb 19 14:22:51 ht01da     E$tag 0x00000000.01000000 E$state_2 Invalid
  Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x00) 0x8143c000.9ba01a2c 0xae34c013.973d2007 ECC 0x1b8 
     *Bad* Esynd=0x003
  Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x10) 0x99150013.b1a01894 0x11800003.9684c014 ECC 0x03f 
     *Bad* Esynd=0x003
  Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x20) 0xe248001b.ada018d2 0xa634c013.8143c000 ECC 0x101 
     *Bad* Esynd=0x003
  Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x30) 0xa93d0014.81ab0a28 0xb3a000a5.988cf4e6 ECC 0x038 
     *Bad* Esynd=0x003
  Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 929717 kern.info] 
     [AFT2] D$ data not available
  Feb 19 14:22:51 ht01da SUNW,UltraSPARC-III+: [ID 335345 kern.info] 
     [AFT2] I$ data not available
  Feb 19 14:22:51 ht01da unix: [ID 321153 kern.notice] 
     NOTICE: Scheduling clearing of error on page 0x00000061.eab2e000
-----------------------------------------------------------------------------

  Section 7:
  ----------

  Finally, an error event occurs which results in an uncorrectable system
  bus error (UE) and a system panic.  The syndrome here shows Syndrome
  0x071 which is typically associated with an L2SRAM error, when in fact
  the data error originated in the DIMM as recorded by the initial CE
  error.  If the diagnosing engineer only looked at this entry, they
  might incorrectly conclude that the error is an L2SRAM, when in fact it
  originated from the DIMM.

  Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 918604 kern.warning] 
     WARNING: [AFT1] Uncorrectable system bus (UE) Event detected by 
     CPU449 Privileged Data Access at TL=0, errID 0x0003c931.2aa7c82f
  Feb 19 14:22:54 ht01da     AFSR 0x00100004.00000071 AFAR 
     0x00000061.eab2e280
  Feb 19 14:22:54 ht01da     Fault_PC 0x104fe68 Esynd 0x0071 SB3/P2/B0 
     J15300 J15400 J15500 J15600
  Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 685643 kern.notice] 
     [AFT1] errID 0x0003c931.2aa7c82f Two Bits in error, likely from E$ 
     WDU/CPU
  Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 577198 kern.info] 
     [AFT2] errID 0x0003c931.2aa7c82f PA=0x00000061.eab2e280
  Feb 19 14:22:54 ht01da     E$tag 0x00000187.aa000049 E$state_2 Shared
  Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x00) 0x3cffffff.ffffffff 0xffffffff.ffffffff ECC 0x00f 
     *Bad* Esynd=0x071
  Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 819380 kern.info] 
     [AFT2] E$Data (0x10) 0x3fffffff.ffffffff 0xffffffff.ffffffff ECC 0x183 
     *Bad* Esynd=0x071
  Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x20) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
  Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 895151 kern.info] 
     [AFT2] E$Data (0x30) 0xffffffff.ffffffff 0xffffffff.ffffffff ECC 0x183
  Feb 19 14:22:54 ht01da SUNW,UltraSPARC-III+: [ID 929717 kern.info] 
     [AFT2] D$ data not available

  Analysis of suspicious bits and comparison with CE reports in the log
  allow us to narrow down the DIMM from the the entire SB3/P2/B0 bank to
  the single DIMM at SB3/P2/B0/D1 J15400.

The corrective action is to replace DIMM SB3/P2/B0/D1 J15400.

The conclusion from the two examples above is:

As with any type of error, not only is it important to review the last
error in the log file, it is absolutely critical to follow the steps
that led up to the error in order to find the source of the error and
replace the appropriate component.  Not retracing the steps may result
in the wrong part being replaced.
Implementation: 
---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        |   |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---
Corrective Action: 
The following recommendation is provided as a guideline for authorized
Sun Services Field Representatives who may encounter the above
mentioned problem.

1. Use the guidelines provided above when diagnosing main memory and
   L2SRAM errors on the listed platforms.

2. Apply the patches recommended in Sun Alert 50471 to prevent
   unnecessary L2SRAM issues. These patches provide the following 
   levels of firmware/ SMS software:

   . Upgrade Sun Fire 12K/15K customers to SMS 1.3 (114608-01 or later)
     or SMS 1.2 (112488-11 or later) at the earliest opportunity.

   . Upgrade Sun Fire 3800 - 6800 customers to 5.14.4 firmware (112883-05
     or later) or 5.13.5 firmware (112494-08 or later) at the earliest
     opportunity.

   . Upgrade Sun Fire V1280 and Netra 1280 customers to 5.13.0012 firmware
     (113751-02 or later).

3. Kernel Update and SunVTS Requirements:
 
   . Per FIN I0909-2, 108528-18 (Solaris 8) and 112233-04 (Solaris 9) are 
     the  minimum recommended kernel updates to be deployed with the firmware
     and SMS patches shown above.

   . Per FIN I0909-2, SunVTS version 5.1 is the minimum recommended SunVTS 
     version. 

4. Escalations/CIC:
   
   Customers that require replacement of the 900 MHz CPU boards or
   servers for suspected L2 SRAM issue(s) will need to follow standard
   Escalation/CIC processes.  The escalation and CIC request will be
   reviewed by the appropriate technical teams.  If CIC action is
   required, the customer will be prioritized for distribution of
   hardware.

   That Escalation policy is published at: 

        http://onestop/programs/us3quality
Comments: 
None.

============================================================================
Implementation Footnote: 
i)   In case of MANDATORY FINs, Sun Services will attempt to contact   
     all affected customers to recommend implementation of the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Sun Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Sun Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.central/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.central/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://spe.sun.com
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------
Statusactive