Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1007736.1
Update Date:2010-12-28
Keywords:

Solution Type  Problem Resolution Sure

Solution  1007736.1 :   Diagnosing multiple DIMM CE errors occurring on multiple DIMMs  


Related Items
  • Sun Fire E6900 Server
  •  
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
  • Sun Enterprise 10000 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Boards>Memory Module
  •  
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
210714


Symptoms
Diagnosing multiple DIMM CE errors occurring on multiple DIMMs

Multiple DIMM CE errors which occur on multiple DIMMs may not indicate a problem with a DIMM. These errors can happen on a variety of systems or domains. They are categorized by a single common Data Bit in error on multiple DIMMs. These errors are not one time CE error events.

Output from cediag may show something similar to the following:

cediag: #### CE Summary since last detected reboot ###########################
cediag: #### last detected reboot at Mar  8 19:35:21 #########################
cediag: findings: 7 DIMM(s) having CEs with Esynd of 0x01b1 found
cediag: advice:HIGH: possible datapath fault - refer to Sun Support [A]s [S]oon  [A]s [P]ossible
cediag: findings: 0 UE(s) found - there is no rule#3 match
cediag: findings: 0 DIMMs with a failure pattern matching rule#4
cediag: findings: 0 DIMMs with a failure pattern matching rule#5

Note - For information on cediag including, downloading, licensing and usage instructions please see, Memory DIMM Replacement Management Tool

Looking at the CE DIMM messages shows:

Mar 25 05:27:58 host1 SUNW,UltraSPARC-III+: [ID 212399 kern.info] NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU162 at TL=0, errID  0x00050a1f.449595f8
Mar 25 05:27:58 host1     AFSR 0x00000002<CE>.000001b1 AFAR 0x00000161.1f50eb60
Mar 25 05:27:58 host1     Fault_PC 0x100460a8 Esynd 0x01b1 SB0/P1/B1/D0 J14301
Mar 25 05:27:58 host1 SUNW,UltraSPARC-III+: [ID 566620 kern.info] [AFT0] errID 0x00050a1f.449595f8 Corrected Memory Error on SB0/P1/B1/D0 J14301 is Intermittent
Mar 25 05:27:58 host1 SUNW,UltraSPARC-III+: [ID 701539 kern.info] [AFT0] errID 0x00050a1f.449595f8 Data Bit 70 was in error and corrected
Mar 25 05:28:04 host1 SUNW,UltraSPARC-III+: [ID 106369 kern.info] NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU162 at TL=0, errID 0x00050a20.a9f4263a
Mar 25 05:28:04 host1     AFSR 0x00000002<CE>.000001b1 AFAR 0x00000161.f82e8930
Mar 25 05:28:04 host1     Fault_PC <unknown> Esynd 0x01b1 SB0/P0/B1/D0 J13301
Mar 25 05:28:04 host1 SUNW,UltraSPARC-III+: [ID 375245 kern.info] [AFT0] errID 0x00050a20.a9f4263a Corrected Memory Error on SB0/P0/B1/D0 J13301 is Intermittent
Mar 25 05:28:04 host1 SUNW,UltraSPARC-III+: [ID 434215 kern.info] [AFT0] errID 0x00050a20.a9f4263a Data Bit 70 was in error and corrected
Mar 25 05:50:37 host1 SUNW,UltraSPARC-III+: [ID 888311 kern.info] NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU162 at TL=0, errID 0x00050b5b.bdee793d
Mar 25 05:50:37 host1     AFSR 0x00000002<CE>.000001b1 AFAR 0x00000161.1f50ea90
Mar 25 05:50:37 host1     Fault_PC 0x1014c858 Esynd 0x01b1 SB0/P2/B0/D0 J15300
Mar 25 05:50:37 host1 SUNW,UltraSPARC-III+: [ID 742879 kern.info] [AFT0] errID 0x00050b5b.bdee793d Corrected Memory Error on SB0/P2/B0/D0 J15300 is Intermittent
Mar 25 05:50:37 host1 SUNW,UltraSPARC-III+: [ID 925230 kern.info] [AFT0] errID 0x00050b5b.bdee793d Data Bit 70 was in error and corrected
Mar 25 05:50:43 host1 SUNW,UltraSPARC-III+: [ID 262634 kern.info] NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU162 at TL=0, errID 0x00050b5d.2390d851
Mar 25 05:50:43 host1     AFSR 0x00000002<CE>.000001b1 AFAR 0x00000161.fa8b7ef0
Mar 25 05:50:43 host1     Fault_PC <unknown> Esynd 0x01b1 SB0/P3/B0/D0 J16300
Mar 25 05:50:43 host1 SUNW,UltraSPARC-III+: [ID 689839 kern.info] [AFT0] errID 0x00050b5d.2390d851 Corrected Memory Error on SB0/P3/B0/D0 J16300 is Intermittent
Mar 25 05:50:43 host1 SUNW,UltraSPARC-III+: [ID 885340 kern.info] [AFT0] errID 0x00050b5d.2390d851 Data Bit 70 was in error and corrected
Mar 25 05:50:49 host1 SUNW,UltraSPARC-III+: [ID 725878 kern.info] NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU162 at TL=0, errID 0x00050b5e.89319d5e
Mar 25 05:50:49 host1     AFSR 0x00000002<CE>.000001b1 AFAR 0x00000161.1f50ea90
Mar 25 05:50:49 host1     Fault_PC <unknown> Esynd 0x01b1 SB0/P2/B0/D0 J15300
Mar 25 05:50:49 host1 SUNW,UltraSPARC-III+: [ID 664224 kern.info] [AFT0] errID 0x00050b5e.89319d5e Corrected Memory Error on SB0/P2/B0/D0 J15300 is Intermittent
Mar 25 05:50:49 host1 SUNW,UltraSPARC-III+: [ID 406614 kern.info] [AFT0] errID 0x00050b5e.89319d5e Data Bit 70 was in error and corrected
Mar 25 05:51:52 host1 SUNW,UltraSPARC-III+: [ID 349173 kern.info] NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU162 at TL=0, errID 0x00050b6d.4f1ba200
Mar 25 05:51:52 host1     AFSR 0x00000002<CE>.000001b1 AFAR 0x00000161.1fd52110
Mar 25 05:51:52 host1     Fault_PC <unknown> Esynd 0x01b1 SB0/P0/B1/D0 J13301
Mar 25 05:51:52 host1 SUNW,UltraSPARC-III+: [ID 993022 kern.info] [AFT0] errID 0x00050b6d.4f1ba200 Corrected Memory Error on SB0/P0/B1/D0 J13301 is Intermittent
Mar 25 05:51:52 host1 SUNW,UltraSPARC-III+: [ID 897387 kern.info] [AFT0] errID 0x00050b6d.4f1ba200 Data Bit 70 was in error and corrected

These message categorize the type of DIMM errors for which this Symptom Resolution is written. Namely a single bit in error over several DIMMs. In this example, the Data Bit in error is always Data Bit 70.

These errors may be called bad reader or bad writer errors.



Resolution
The resolution is specific for multiple DIMMs reporting the same bit in error and for one CPU correcting the error.

For these errors the solution comes from the first line of the message:

Mar 25 05:27:58 host1 SUNW,UltraSPARC-III+: [ID 212399 kern.info] NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU162 at TL=0, errID 0x00050a1f.449595f8

The CPU detecting the error, CPU162, in this case is the same for all of the messages. This indicates that the CPU is either reading or writing the data incorrectly.

Either the CPU or the SB the CPU is on needs to be replaced.



Product
Sun Fire E6900 Server
Sun Fire 6800 Server
Sun Enterprise 10000 Server
Sun Fire E25K Server
Sun Fire E20K Server
Sun Fire 15K Server
Sun Fire 12K Server

Internal Comments
Solutions for Diagnosing multiple DIMM CE errors

See also:

<Document: 1010642.1>  Diagnosis of bad writers and datapath faults from Solaris messages

<Document: 1005028.1> Sun Fire [TM] 12K/15K/E20K/E25K: Distinguishing a CPU Which is a BAD Writer From One Which is a BAD Reader



<Document: 1010934.1>  Findaft an AFT, CPU, Memory and PCI ECC error message summary script


CE, memory, ECC, bad reader, bad writer, 2K, 15K, E20K, E25K
Previously Published As
80950

Change History
Date: 2010-04-27
User Name: Cootware
Action: Added link to cediag doc.
Verified valid information.
Comment: Document should be archived at Starfire EOSL


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback