Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-73-1017422.1
Update Date:2010-11-10
Keywords:

Solution Type  FAB (standard) Sure

Solution  1017422.1 :   Hardware: A limited number of AMD Opteron CPUs in X4100, X4200, X4500 and X4600 systems can cause unexpectedly system shut down without warning or trace evidence.  


Related Items
  • Sun Fire X4600 Server
  •  
  • Sun Fire X4100 Server
  •  
  • Sun Fire X4500 Server
  •  
  • Sun Fire X4200 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Sun FAB>Hardware Remediation>Reactive
  •  

PreviouslyPublishedAs
228519


Product
Sun Fire X4100 Server
Sun Fire X4600 Server
Sun Fire X4200 Server
Sun Fire X4500 Server

Bug Id
<SUNBUG: 6439409> (X4100/X4200)
<SUNBUG: 6515060> (X4600)

Date of Resolved Release
24-DEC-2006


Impact

A limited number of AMD Opteron Revision E CPUs manufactured prior to December 24, 2006, under specific conditions, can generate a false internal temperature reading which can cause the platform to power down without warning or trace evidence.

AMD CPUs contain an internal CPU thermsense circuit, called ThermSenseMacro (TSM). This TSM circuit is designed to protect the CPU and system from over-temperature conditions. A small number of AMD's single and dual core Revision E CPU TSMs (manufactured prior to week 52, 2006) may generate a false temperature reading above the 125C set point and induce a platform power down.

AMD Failure analysis data indicates a full field Defects Per Million rate of well below 500, i.e. 0.05%.


Contributing Factors

Operation at cooler CPU case temperatures combined with the execution of applications that generate high levels of floating point and memory access activity with AMD Opteron Single and Dual Core Revision E Series 200 and 800 CPUs.

The following system types and part numbers could be impacted:

X4500 (has 2 CPUs):
Sun P/N      AMD OPN        Description
371-0856-01  OSA285FAA6CB   285 AMD Opteron Dual Core CPU (2.6 GHZ) - E STEP
X4600 (has up to 8 CPUs):
Sun P/N      AMD OPN        Description
370-7961-01  OSA854FAA5BM   854 AMD Operton CPU (2.8 GHZ) - E STEP
371-1759-01  OSA856FAA5BM   856 AMD Operton CPU (3.0 GHZ) - E STEP
371-0291-01  OSA880FAA6CC   880 AMD Opteron Dual Core CPU (2.4 GHZ) - E STEP
371-1760-01  OSA885FAA6CC   885 AMD Opteron Dual Core CPU (2.6 GHZ) - E STEP
X4100 and X4200 (has 1 or 2 CPUs):
Sun P/N      AMD OPN        Description
370-7711-01  OSA248FAA5BL   248 AMD Opteron CPU (2.2 GHZ) - E STEP
370-7937-01  OSA252FAA5BL   252 AMD Opteron CPU (2.6 GHZ) non-RoHS - E STEP
370-7272-01  OSA252FAA5BL   252 AMD Opteron CPU (2.6 GHZ) - E STEP
370-7934-01  OSA252FAA5BL   252 AMD Opteron CPU (2.6 GHZ) RoHS - E STEP
370-7962-01  OSA254FAA5BL   254 AMD Opteron CPU (2.8 GHZ) - E STEP
371-1776-01  OSA256FAA5BL   256 AMD Opteron (3.0 GHZ) - E STEP
370-7798-01  OSA265FAA6CB   265 AMD Opteron Dual Core CPU (1.8 GHZ) - E STEP
370-7799-01  OSA270FAA6CB   270 AMD Opteron Dual Core CPU (2.0 GHZ) - E STEP
370-7800-01  OSA275FAA6CB   275 AMD Opteron Dual Core CPU (2.2 GHZ) - E STEP
371-0839-01  OSA280FAA6CB   280 AMD Opteron Dual Core CPU (2.4 GHZ) RoHS - E STEP 95 Watt
370-7938-01  OSY280FAA6CB   280 AMD Opteron Dual Core CPU (2.4 GHZ) - E STEP 120 Watt
371-0856-01  OSA285FAA6CB   285 AMD Opteron Dual Core CPU (2.6 GHZ) - E STEP 95 Watt
371-0935-01  OSY285FAA6CB   285 AMD Opteron Dual Core CPU (2.6 GHZ) - E STEP 120 Watt
371-1779-01  OSA290FAA6CB   290 AMD Opteron Dual Core CPU (2.8 GHZ) - E STEP 95 Watt

This FAB does not support CPU replacements for the V40z platforms because that platform captures THERMTRIP events in the SEL, and because we have not had any confirmed false THERMTRIPs on the V40z platform.


Symptoms

Platform will be powered down without any warning and go into standby mode. SP log or the SEL entries will not have any indication of the cause of the power down (except for X4600 systems running with BIOS 44).


Root Cause

An AMD CPU manufacturing test process did not adequately screen TSM faults.

On December 24, 2006 (calendar week 52), AMD implemented a new manufacturing test process that separates out suspect CPUs.

A Regional Stocking Location (RSL) purge will not be implemented due to the extremely low potential for experiencing this issue.

Special Considerations:

There will be no charge to customers for any onsite activities or materials used related to this Field Action Bulletin.

Based on the successful CPU replacements at the five HPC sites, Sun field engineers are expected to replace CPUs for this FAB.  CPU daughter card replacements are not funded or supported by AMD.

Replacement CPUs will not be stored at Sun RSLs.  Instead, AMD will provide the logistics support and CPU shipments directly to/from customer sites.

AMD will provide CPUs in support of this activity until April 30, 2008.


Resolution

Replacement Time Estimate: 15 minutes (per CPU)

Hot Swappable: No

Special Considerations:

Sun's Systems Group Quality Office, in advance of this FAB, actively engaged AMD to provide root cause. Replacement CPUs were provided to five high performance computing (HPC) grid accounts that experienced false THERMTRIP events. Affected CPUs were replaced by Sun field engineers. The purpose of this FAB is to provide support for other customers who experience spurious power downs due to the false THERMTRIP event.

Customer does not need to be under contract to have their product repaired if affected by this issue.

A BIOS upgrade for X4600 is available which includes diagnostics to confirm a THERMTRIP event. This FAB includes instructions for diagnosing X4600 systems both with and without the BIOS upgrade, should the customer decide to not upgrade to BIOS 44.

Final Resolution:

1. Verify that the system has reset to a standby condition.
2. If the system has rebooted, this is not a false THERMTRIP issue. Stop, this FAB
does not apply to your event.
3. Check SP logs for SEL entries by entering the following IPMI command:
ipmitool -I lanplus -H <SP_IP address> -U root sel elist
  <After entering return you will be prompted for the ILOM password>
4. Verify that that the BIOS screen does not show 'log full'. If it does, your SP is
unable to store new events, and the log is unusable to diagnose whether you have had
a false THERMTRIP event. To clear the log, go to the BIOS setup screen and follow
the instructions to clear the log. The SP log can store a large number of entries
and will reach this 'full' condition only under unusual conditions/multiple issues.
5. For X4100, X4200, X4500, and X4600 without the BIOS 44 upgrade:
An SEL entry is not captured for THERMTRIP events on Galaxy platforms, (with the
exception of X4600 with BIOS 44 upgrade, see section 6). A false THERMTRIP event
can only be identified by ruling out other power-down events: true over-temperature
conditions and manual power downs.
Ask the customer if any manual intervention to power down the system occurred. If so,
the SEL entries associated with the manual power down should be ignored, and not
considered a false THERMTRIP event.
Verify that the system did not experience a true over-temperature condition caused
by the environment.  In a true over-temp condition, the system reacted properly and
shutdown as expected.  This condition is not a false THERMTRIP, and therefore does
not apply to this FAB.
The logs would contain temperature thresholds being exceeded before the platform
powered down.  There are three thresholds: Upper Non Recoverable, Upper Critical
and Upper Non Critical.  You should see these thresholds in the SEL as they are
exceeded.
Note: if SEL contains the following information, this FAB does not apply.
Over-temperature Output Example:
1f04 | 05/11/2007 | 11:10:38 | Temperature p0.t_core | Upper Critical going high | Reading 68 > Threshold 67 degrees C
2004 | 05/11/2007 | 11:10:43 | Processor p0.fail | Predictive Failure Asserted
2104 | 05/11/2007 | 11:11:51 | Temperature p1.t_core | Upper Critical going high | Reading 68 > Threshold 67 degrees C
2204 | 05/11/2007 | 11:11:55 | Processor p1.fail | Predictive Failure Asserted
2304 | 05/11/2007 | 11:12:31 | Temperature p0.t_core | Upper Non-recoverable going high | Reading 76 > Threshold 75 degrees C
2404 | 05/11/2007 | 11:13:08 | Power Supply ps0.pwrok | State Deasserted **
2504 | 05/11/2007 | 11:13:10 | Power Supply ps1.pwrok | State Deasserted **
  ** System has been forcefully shut down by the SP.
Note: Time stamps from SEL LOG are GMT time by default.
If you have ruled out a manual power down (verified by the customer),
a true over-temperature condition (as described in the example above),
and there are no other conditions recorded in the SEL logs to explain the
power down, then proceed with this FAB.
6. For X4600 with optional BIOS 44 Upgrade installed (available only on X4600):
The BIOS 44, 0ABHA044, will provide an SEL entry for a THERMTRIP event (both false
THERMTRIP and true over-temperature events).
6.1. When a THERMTRIP error occurs, the system will power down by default.
 
6.2. If the user powers on the system, the BIOS will detect the error and display
an error message in three locations:
1. POST: "A Thermal Event from SouthBridge occurred on last boot"
2. DMI event log (in F2 Setup): "A Thermal Event from SouthBridge occurred
on last boot"
3. IPMItool: 1800 | 02/21/2007 | 11:04:42 | Processor | Thermal Trip | Asserted
6.3. If you have the above POST/DMI/IPMItool messages and there were no other warnings
by the Service Processor (SP) on over temperature of the ambient condition (a real
THERMTRIP condition) just prior to the shutdown, then you have affected CPUs which
should be replaced, after confirming the Datecode (reference step 9.4).
7. Verify your system has an affected AMD CPU by ipmi command:
ipmitool -I lanplus -H <SP_IP address> -U root fru print
  <After entering return you will be prompted for the ILOM password>
Example of Output:
FRU Device Description : p0.fru (ID 6)
Product Manufacturer  : ADVANCED MICRO DEVICES
Product Name          : DUAL CORE AMD OPTERON(TM) PROCESSOR 290
Product Part Number   : 0F21
Product Version       : 02
FRU Device Description : p1.fru (ID 7)
Product Manufacturer  : ADVANCED MICRO DEVICES
Product Name          : DUAL CORE AMD OPTERON(TM) PROCESSOR 290
Product Part Number   : 0F21
Product Version       : 02

Product Name should match one of the CPU numbers listed in the Contributing Factors section above.

 
8. Identification of Affected Parts before CPUs arrive onsite:
8.1. Verify the system symptoms match the "Final Resolution" requirements.
8.2. Do not remove the heatsink or CPU until you have received replacement CPU's.
     To minimize the amount of downtime at the customer site, CPU's will be shipped
     in advance of opening the system.

  To request CPUs:

1. Complete the 'TSM-CPU-Tracker' template located at...
       http://sdpsweb.central/FIN_FCO/FAB/102880/SPE/TSM-CPU-tracker.ods
   Note: Browser may show garbage on screen depending on your browser settings.
         If this occurs perform a File -> Save Page As to your disk, then
         open it from your local disk.
2. Create an email with the following information:
1. Address the email to [email protected]
2. Enter in the Subject line: 'TSM RMA Request'
3. Enter in email body:
	1. Customer Company Name
	2. Customer Contact Name
	3. Customer Location
	4. Sun Contact Name
	5. Sun Contact Phone
	6. Sun Contact email
	7. Complete Ship-to Address
	8. Complete OPN Part # & Quantity Requested
           Currently it is not possible to identify the failing CPU, so
           the total number of CPU's that are installed on the platform
           will need to be ordered.
           Note: not all CPUs will need to be fitted (see section 9.1).
	9. If you have ruled out other thermtrip possible causes such as those
that would be recorded in the SEL logs, then proceed with the FAB.
     4. Attach partially completed 'TSM-CPU-Tracker' template
3. Send email
 
8.3. Upon validation of your SEL feedback, AMD will ship:
1. CPUs, thermal grease and alcohol wipes
2. Return shipping documentation via email response, including
1. RMA number
2. An updated 'TSM-CPU-Tracker' template
 
8.4. AMD detailed handling, ESD requirements, packing, CPU removal, and
       installation instructions are located at...

    http://sdpsweb.central/FIN_FCO/FAB/102880/SPE/AMD_Handling_070502.pdf

 

9. Identification of Affected Parts after the CPUs arrive onsite:
9.1. System design does not allow us to identify the specific offending CPU, so a
visual check of each CPU is required BEFORE removing the CPU.
9.2. Follow AMD handling instructions for careful removal of CPU heatsink and CPU,
per the 'TSM Field Remediation Process for Sun Microsystems' document found
via the below link...
http://sdpsweb.central/FIN_FCO/FAB/102880/SPE/AMD_Handling_070502.pdf
   Note: Browser may show a blank screen depending on your browser settings.
         If this occurs perform a File -> Save Page As to your disk, then
         open it from your local disk.
 
9.3. Capture all CPU and slot information in the 'TSM-CPU-Tracker' template located
via the below link...
    http://sdpsweb.central/FIN_FCO/FAB/102880/SPE/TSM-CPU-tracker.ods
   Note: Browser may show garbage on screen depending on your browser settings.
         If this occurs perform a File -> Save Page As to your disk, then open
         it from your local disk.
9.4. Verify the Datecode of each CPU:
1. Reference CPU photo located in either the TSM-CPU-Tracker or Handling
instructions for location of Datecode & 'screening mark'.
2. Affected CPUs have Datecodes of 0651 or earlier (0650, 0649, 0648, ...)
3. CPUs with Datecodes later than 0651, or CPUs that have been etched with the
'screening mark', should not be removed.
4. It is the FE's responsibility to ensure that only "affected" CPUs are removed.
5. Use the alcohol wipes provided by AMD to thoroughly clean the used thermal
grease from the bottom of the heatsink and lid of the CPU.  Each thermal
grease syringe provided by AMD has sufficient grease for the application
of (1) CPU.  For detailed CPU installation instructions, please reference
the 'TSM Field Remediation Process for Sun Microsystems' document located
via the below URL...
      http://sdpsweb.central/FIN_FCO/FAB/102880/SPE/AMD_Handling_070502.pdf
6. Reattach the heatsink to the original CPU and return the 'good' CPU back
to AMD, along with any replaced CPUs taken from other slots.
9.5. Install new CPU for Datecode-validated or rescreen-validated CPU removals.
9.6. Pack and Ship CPUs per AMD handling instructions document.
   1. Label package with the RMA number and ship to:
      AMD
      5204 E. Ben White Blvd, MS 574
      Austin, TX 78741
      USA
      Attn: Ed Zahradnik
      TSM RMA # : 6XXX XXXX [8 Digits]        QTY : ____
2. Send the AWB and TSM-CPU-Tracker file to [email protected]

 


Previously Published As
102880

Comments

This issue was evaluated as, and determined not to meet criteria for, an FCO due to the low potential of exposure involving very specific configurations.


For replacement materials sent from AMD to the customer site:

Shipment terms: CIP (Carriage and Insurance Paid to customer destination)

Exporter of record: AMD

Importer of record: customer

Declared value of the shipment: AMD's current market price for the respective Ordering Part Number (OPN)



For replaced material returning to AMD:

Shipment terms: FCA (Free Carrier - customer pick up location)

Exporter of record: customer

Importer of record: AMD

Declared value of the shipment: AMD's current market price for the respective OPN



In some cases, the customer may be a Sun Field Engineer responsible for servicing the customer account who is handling the receipt/return of CPUs.



AMD assumes all freight and customs costs and, therefore, will pay for the freight for each movement to and from the customer site.


Related Information
  • URL: http://sdpsweb.central/FIN_FCO/FAB/102880/SPE/


Internal Contributor/submitter
[email protected], [email protected]

Internal Eng Business Unit Group
KE Authors

Internal Eng Responsible Engineer
[email protected], [email protected]

Internal Services Knowledge Engineer
[email protected]

Internal Escalation ID
1-16348496, 1-17642328, 1-17935330, 1-20735147, 1-20835776

Internal Kasp FAB Legacy ID
102880

Internal Sun Alert & FAB Admin Info
Critical Category:
Significant Change Date: 2007-05-15
Avoidance: Hardware
Responsible Manager: [email protected]
Original Admin Info: WF - Initiated draft and awtg feedback from questions asked
during intial review. - Joe 4/11/07
WF - awtg a resubmission from Kim Mayman, who is waiting
on updated info from Duncan Morton, Sun Supply Chain
Manager for AMD products. - Joe 4/20/07
WF - FAB resubmitted by sponsor w/updates - Joe 4/23/07
WF - finalized draft and sent to extended review - Joe 5/2/07
WF - updated by submitter, still in review - Joe 5/4/07
WF - final updates per submitter, sending to publsih - Joe 5/15/07
WF - FAB not showing as published. Put word "Hardware" at the
beginning of Synopsis and will republish. - Joe 5/16/07
WF - corrected impitool command in Resolution. - Joe 5/17/07
WF - added "| Asserted" to end of step 6.2.3. - Joe 5/22/07
Product_uuid
54e2ac49-df71-11d9-89e6-080020a9ed93|Sun Fire X4100 Server
72cdbb85-7cd3-11da-8990-080020a9ed93|Sun Fire X4600 Server
c6e795ef-df6f-11d9-89e6-080020a9ed93|Sun Fire X4200 Server
f4bbfa5f-e6e5-11da-ac3d-080020a9ed93|Sun Fire X4500 Server

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback