Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1010407.1
Update Date:2010-09-08
Keywords:

Solution Type  Technical Instruction Sure

Solution  1010407.1 :   DTAG parity error Troubleshooting and Analysis  


Related Items
  • Sun Enterprise 3000 Server
  •  
  • Sun Enterprise 4500 Server
  •  
  • Sun Enterprise 5500 Server
  •  
  • Sun Enterprise 4000 Server
  •  
  • Sun Enterprise 5000 Server
  •  
  • Sun Enterprise 6000 Server
  •  
  • Sun Enterprise 3500 Server
  •  
  • Sun Enterprise 6500 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  

PreviouslyPublishedAs
214288


Applies to:

Sun Enterprise 3000 Server
Sun Enterprise 3500 Server
Sun Enterprise 4000 Server
Sun Enterprise 4500 Server
Sun Enterprise 5000 Server
All Platforms

Goal

Description:

This document describes how to perform analysis of DTAG parity error events on Sun Enterprise 3x00/4x00/5x00/6x00 (aka Classic) Servers and determine if a replacement action is necessary.

Examples:

A DTAG Parity Error event is often only visible on the system console (sometimes called the console log since this is often logged on a console server) and is usually seen within Fatal Reset output.

An example from console log data is below:

17-OCT-2001 17:07:55.17 LBC5   Fatal Reset 
17-OCT-2001 17:07:56.69 LBC5 0,0>FATAL ERROR
17-OCT-2001 17:07:57.15 LBC5 0,0> At time of error: System software was running.
17-OCT-2001 17:07:57.37 LBC5 0,0> Diagnosis: Board 2, Dtag B (UPA Port1),AC
17-OCT-2001 17:07:57.37 LBC5 0,0>Log Date: Oct 17 21:17:19 GMT 2001 17-OCT-2001 17:07:57.37 LBC5 0,0>
17-OCT-2001 17:07:57.58 LBC5 0,0>RESET INFO for CPU/Memory board in slot 2
17-OCT-2001 17:07:57.58 LBC5 0,0> AC ESR 00000010.00000000 DT_PERRB
17-OCT-2001 17:07:57.59 LBC5 0,0> DC[0] 00
17-OCT-2001 17:07:57.59 LBC5 0,0> DC[1] 00
17-OCT-2001 17:07:57.59 LBC5 0,0> DC[2] 00
17-OCT-2001 17:07:57.59 LBC5 0,0> DC[3] 00
17-OCT-2001 17:07:57.59 LBC5 0,0> DC[4] 00
17-OCT-2001 17:07:57.59 LBC5 0,0> DC[5] 00
17-OCT-2001 17:07:57.59 LBC5 0,0> DC[6] 00
17-OCT-2001 17:07:57.80 LBC5 0,0> DC[7] 00
17-OCT-2001 17:07:57.80 LBC5 0,0> FHC CSR 00050030 LOC_FATAL SYNC BRD_LED_M BRD_LED_R
17-OCT-2001 17:07:57.80 LBC5 0,0> FHC RCSR 02000000 FATAL
17-OCT-2001 17:07:57.80 LBC5 0,0> Config policy change
17-OCT-2001 17:07:57.80 LBC5 0,0>
17-OCT-2001 17:07:57.80 LBC5 0,0>@(#) POST 3.9.28 2000/12/20 12:29
17-OCT-2001 17:07:58.02 LBC5 0,0>Copyright 2000 Sun Microsystems, Inc. All rights reserved.

In the example above the DTAG parity error occurred on System Board 2. 

NOTE that the port can also be Port A.

The FIX section of this article will explain further details of this event.

  • In order to analyze such an event it is important to have console log data so Document 1008702.1 Console Logging Options to capture Fatal Reset output for Sun systems may help you if needing to configure console logging.
A DTAG event may also be seen in prtdiag output (in the section called Analysis of most recent Fatal Hardware Watchdog).  This is about the only type of Fatal Error event that can be diagnosed from prtdiag output alone. 

A DTAG error looks like the following in prtdiag:
     AC: UPA Port B Dtag Parity Error
NOTE that the port can also be Port A for example:

     AC: UPA Port A Dtag Parity Error

Once you have determined that your event matches what has been described above, proceed to the FIX section of this article to resolve the event.

Solution

What is a DTAG parity Error?

The event DT_PERR indicates a Duplicate Tag SRAM (DTAG) parity error. These DTAG SRAM's reside on CPU/Memory boards in Sun Enterprise 3x00/4x00/5x00 (Classic) Servers. DTAG's are duplicates of the CPU's ETAG's on the system board.
  • DT_PERRA refers to DTAG SRAM's supporting CPU location 0.
  • DT_PERRB refers to DTAG SRAM's supporting CPU location 1.

Notes about troubleshooting DTAg Errors:

DTAG errors are usually caused by bit flips in DTAG SRAM. DTAG SRAM is located on the system board.   The same issue which cause bit flips in memory (Alpha Particles, handling and environmental conditions) cause bit flips in DTAG SRAM.

The CPUs and memory on a System Board which receives a DTAG parity error are never the cause.

Repair Vendor testing of system boards which received DTAG parity errors prove that more then 90% of the time, these errors are transient and never occur again.

For this reason, Oracle's Best Practices (originally was "Sun's Best Practices") dictates that if a DTAG parity error occurs the recommendation is:

  1. Power cycle the system with max diags to re-POST the hardware.
    • The DTAG error event may have disabled the board. 
    • To bring it back online and test it's sanity, the power cycle and max diag POST execution is recommended.
  2. If no error is detected in POST, monitor the system for repeat errors but do not replace the system board
  3. If a second DTAG error occurs on the same system board and same DTAG group in 6 months, replace the system board which has indicated the errors.

From the example in the GOALS section:

The DTAG parity error occurred on System Board 2. The DTAG memory which suffered a bit flip was associated with CPU location 1,  (DT_PERRB).  If this was the second occurrence of the same error the system Board in the past 6 months, system board 2 should be replaced.  But, if this was a first error, Best Practice dictates that the board should not be replaced.

Additional Information:

One leading cause of DTAG errors are Environmental factors.  A good environmental resource to utilize is Document 1011650.1 Sun Enterprise[TM] 3X00-6X00 Servers: Board Temperature Information.

@ Previously Published As 40760

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback