Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1002710.1
Update Date:2011-05-23
Keywords:

Solution Type  Technical Instruction Sure

Solution  1002710.1 :   Sun Fire[TM] v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900, and Netra[TM] 1280, and 1290 systems: Incoming versus Outgoing errors.  


Related Items
  • Sun Fire 4810 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Netra 1290 Server
  •  
  • Sun Fire E6900 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Netra 1280 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  
  • GCS>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  

PreviouslyPublishedAs
203717


Applies to:

Sun Netra 1290 Server - Version: Not Applicable and later   [Release: N/A and later ]
Sun Fire E6900 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire E4900 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire 4810 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire V1280 Server - Version: Not Applicable and later    [Release: N/A and later]
All Platforms

Goal

Description
This document applies to Sun Fire[TM] v1280, 3800, 4800, 4810, 6800, E2900, E4900, E6900, and Netra[TM] 1280, and 1290 systems.

This document relates to the diagnosis of error events that get logged to a file called the error buffer on the System Controller (SC) on the systems shown above.  The error buffer log file data is collected by the command showerrorbuffer when running an Explorer using the scextended or 1280extended option.  Alternatively, a user can display this information directly on the System Controller by executing the command as follows (This example is from the lom prompt on an E2900 server):
lom> showerrorbuffer

ErrorData[0]
Date: Sat Aug 18 09:50:39 EDT 2007
Device: /SB0/dx3
ErrorID: 0x33071ff3
Port: 3
Syndrome: 0xd(CE bit 41)
Direction: outgoing read
TargetAid: 0x3
Transid: 0x1
ErrorData[1]
Date: Sat Aug 18 09:50:39 EDT 2007
Device: /SB2/dx3
ErrorID: 0x33071ff3
Port: 3
Syndrome: 0xd(CE bit 41)
Direction: incoming read
First error: true
TargetAid: 0x3
Transid: 0x1

The error example above will be used in the remainder of this article to explain the relation of Incoming to Outgoing as it relates to error message diagnosis.

Solution

Diagnosing incoming versus outgoing errors in the showerrorbuffer file.

What is the relation of the terms Incoming and Outgoing?

The answer is actually kind of easy, because the terms are related to a direction of a data transaction.  There are two possible directions for an error event to "travel" and the direction is "as it relates to the dx asic" (picture below illustrates the data path in question here between DX and DCDS):

  • Outgoing - An error that is moving away from the dx asic (Ultimately to a DCDS/CPU/Memory on the board or off to some other board).
  • Incoming - An error that is moving towards the dx asic (From a DCDS/CPU/Memory on the reporting dx asic's board).

Why do we care about what direction the error "travels"?

The short answer is that because this is an error. 

The longer answer is that the event(s) may mean that there is defective hardware involved if the errors are uncorrectable or excessive (exceeding Oracle's Memory Error Best Practice) in nature.  Knowing the direction of the event allows a user to identify the source of the error which is crucial to resolving the event and stopping the errors. 

The direction of the transaction identifies for us the source and thus Root Cause to the event.

Now, how we do identify the direction that an event is "traveling" and identify the source?  Using the same error example as before:

ErrorData[0] 
Date: Sat Aug 18 09:50:39 EDT 2007
Device: /SB0/dx3 <--- This dx is reporting the event.
ErrorID: 0x33071ff3
Port: 3 <--- This is the CPU number implicated.
Syndrome: 0xd(CE bit 41) <--- This is the error syndrome.
Direction: outgoing read <--- This is the direction of the event
TargetAid: 0x3
as it relates to the dx.
Transid: 0x1
Outgoing means that the error's direction went from the dx asic (SB0/dx3) to the CPU (SB0/P3) or it's Memory (through the DCDS).  This is what is called a "Victim" event because the error came from somewhere else and the dx asic "passed it along".

The next error from the example error log file shows a "Source" event.  Source events are root cause events.

ErrorData[1] 
Date: Sat Aug 18 09:50:39 EDT 2007
Device: /SB2/dx3 <--- This dx is reporting the event.
ErrorID: 0x33071ff3
Port: 3 <--- This is the CPU number implicated.
Syndrome: 0xd(CE bit 41) <--- This is the
error syndrome.
Direction: incoming read <--- This is the direction of the event
First error: true
as it relates to the dx.
TargetAid: 0x3
Transid: 0x1

Incoming means that the error's direction went from the CPU (SB2/P3) or memory via the DCDS to the dx asic (SB2/dx3).  This means that the error sourced from the DCDS, the CPU or it's memory (the CPU is a memory controller).  The dx is simply reporting that a CPU it monitors has seen the error and forwards it along - to become a different dx asic's Outgoing event.

In the above example, the Root Cause suspects would be SB2 DIMM pair J16500/J16501 because data bit 41 (ESYN 0xd) translates to that DIMM pair. 

  • If there were correlating ecc errors in the domain's /var/adm/messages file that showed only one DIMM bank in error, then the error would be further isolated to a single DIMM (either Bank 0 or Bank 1). 
  • The suspect(s) should be replaced ONLY IF meeting the Best Practice rules as defined in Document 1010905.1 Oracle Enhanced Memory DIMM Replacement Policy


NOTES:

  • It is worth mentioning that this document discusses one of the easiest error examples to diagnose as it relates to Incoming/Outgoing directions.  It showed "read" transactions. 
    • A read is almost always sourced to a memory DIMM.
  • If you see an "incoming write" from a single CPU location with many different "outgoing reads", suspect the CPU who is related to the "incoming write" transaction as Root Cause.
  • Big rule:  CPUs "write" and DIMMs "read" so, when only "read's.

Internal Comments
There is an ESYN Translator located at http://panacea.uk.oracle.com/twiki/bin/view/Tools/ToolPageEsynDecoderUniboard
which can be used to translate ECC syndromes as shown in this article's example.

Previously Published As 90269

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback