Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type FAB (standard) Sure Solution 1001165.1 : System Hangs After MCE (Machine Check Exception) Correctable Memory Errors With 1GB Micron DIMMs In Slots 1 And 2.
PreviouslyPublishedAs 201559 Product Sun Fire X2100 Server Sun Ultra 20 Workstation Bug Id <SUNBUG: 6408744> Part
Impact When marginal 1GB Micron DIMMs have been placed in DIMM slots 1 and 2, correctable ECC memory errors can take place. There may also be system hangs seen after the memory errors in extreme cases. This is caused by DIMMs manufactured to the edge of tolerance levels being placed in slots 1 and 2, which are more susceptible to signal integrity issues due to bus layout and length. Some examples of errors that can be seen when marginal 1GB Micron DIMMs are placed in slots 1 and 2: Mar 27 13:36:04 testsystem sshd(pam_unix)[16554]: session opened for user root by (uid=0) Mar 27 15:20:14 testsystem kernel: CPU 0: Silent Northbridge MCE Mar 27 15:20:14 testsystem kernel: Northbridge status 946ac001:00000813 Mar 27 15:20:14 testsystem kernel: Error ecc error Mar 27 15:20:14 testsystem kernel: bus error local node origin, request didn't time out Mar 27 15:20:14 testsystem kernel: generic read Mar 27 15:20:14 testsystem kernel: memory access, level generic Mar 27 15:20:14 testsystem kernel: link number 0 Mar 27 15:20:14 testsystem kernel: err cpu1 Mar 27 15:20:14 testsystem kernel: corrected ecc error Mar 27 15:20:14 testsystem kernel: previous error lost Mar 27 15:20:14 testsystem kernel: NB error address 000000004ed1a430 Mar 27 15:31:09 testsystem automount[19280]: lookup(ldap): got answer, but no first entry for (&(objectclass=nisObject)(cn=budny)) This second one shows the hang and reboot: Mar 28 03:33:42 testsystem kernel: CPU 0: Silent Northbridge MCE Mar 28 03:33:42 testsystem kernel: Northbridge status 946ac002:00000813 Mar 28 03:33:42 testsystem kernel: Error ecc error Mar 28 03:33:42 testsystem kernel: bus error local node origin, request didn't time out Mar 28 03:33:42 testsystem kernel: generic read Mar 28 03:33:42 testsystem kernel: memory access, level generic Mar 28 03:33:42 testsystem kernel: link number 0 Mar 28 03:33:42 testsystem kernel: err cpu0 Mar 28 03:33:42 testsystem kernel: corrected ecc error Mar 28 03:33:42 testsystem kernel: previous error lost Mar 28 03:33:42 testsystem kernel: NB error address 000000004ed0c630 Mar 28 08:19:52 testsystem syslogd 1.4.1: restart. Mar 28 08:19:52 testsystem syslog: syslogd startup succeeded Tests that can be run to verify the issue are PcCheck. Here is an example below of errors that will be seen with PcCheck: Failed Microtopology test(uTL)Dimm slot A0 "Last failure 00000000:0AF5CC70 Coupled bits detected, read 0008000AH" To run the PcCheck diagnostics follow the steps below:
Root Cause Resolution When 1GB Micron DIMMs are installed in slots 1 and 2 and are exhibiting correctable ECC memory errors, before replacing any DIMMs move them to slots 3 and 4 and retest. If errors continue in slots 3 and 4 then assume you have a failing DIMM and replace the pair using normal DIMM replacement procedures and policies. If the errors go away after moving the DIMMs to slots 3 &and 4 then leave the DIMMs in slots 3 and 4 and do not replace the DIMMs, as these are likely marginal DIMMs. Note 1: The DIMMs are not defective they are just built at the edge of certain tolerance levels so when placed on the far end of the memory bus (slots 1 and 2) they can exhibit errors. When removed from slots 1 and 2 and placed in other slots, the DIMMs will function properly. Note 2: If customers wish to later expand memory configurations by adding another pair of memory DIMMs to slots 1 and 2, no further issues will be experienced even if marginal DIMMs are placed in slots 1 and 2. By adding another pair of DIMMs, the signal of the whole memory bus is changed enough that even with marginal DIMMs no further errors will take place regardless of which slots they occupy. Previously Published As 102448 Internal Comments None. Internal Contributor/submitter [email protected] Internal Eng Business Unit Group KE Authors Internal Eng Responsible Engineer [email protected] Internal Services Knowledge Engineer [email protected] Internal Kasp FAB Legacy ID 102448 Internal Sun Alert & FAB Admin Info Critical Category: Significant Change Date: Avoidance: Service Procedure Responsible Manager: null Original Admin Info: null Product_uuid 28c0502a-fd60-11d9-a8ca-080020a9ed93|Sun Fire X2100 Server 372415be-961d-11d9-9adf-080020a9ed93|Sun Ultra 20 Workstation Attachments This solution has no attachment |
||||||||||||
|