Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1017926.1
Update Date:2010-07-15
Keywords:

Solution Type  Problem Resolution Sure

Solution  1017926.1 :   Sun Fire[TM] 3800-6800: Troubleshooting NCPQ_TO errors  


Related Items
  • Sun Fire 3800 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire 4800 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  

PreviouslyPublishedAs
229185


Applies to:

Sun Fire 3800 Server
Sun Fire 4800 Server
Sun Fire 6800 Server
All Platforms

Symptoms

Description:

This document aids in troubleshooting Non-Cacheable Pending Queue Time Outs (NCPQ_TO) on Sun Fire 3800/4800/4810/6800 systems.  NCPQ_TOs occur when data requests in Non-Cacheable address space do not complete a transaction. Non-Cacheable addresses space is Safari Device config and I/O address space.

In most situations this is a type of error that requires Support Services to be engaged in order to be resolved.  Please review the document to see if the resolution to this issue is available within, but if not, you will likely be required to log a Service Request with Support Services to get resolution.  Please mention this article when logging the Service Request.

Symptoms:

Error messages indicating a NCPQ_TO occurred are seen on the Domain Console.  The error messages are also stored in the Domain Console Buffer and can be retrieved by the Sun Fire System Controller (SC) command showlogs.  If a loghost is configured, the error messages are stored on the loghost.  NCPQ_TOs can occur during normal operation of the Domain or during POST.

Here is an example log of a NCPQ_TO error:
Feb 26 10:46:02 systemx DomC.SC: ErrorMonitor:Domain C has a SYSTEM ERROR
Feb 26 10:46:02 systemx DomC.SC: /N0/SB1 encountered the first error
Feb 26 10:46:02 systemx DomC.SC: RepeaterSbbcAsic reported first error on /N0/SB1
Feb 26 10:46:02 systemx DomC.SC: /partition1/domain0/SB1/bbcGroup0/sbbc0:
FE [15:15] : 0x1
ErrSum [31:31] : 0x1
SafErr [09:08] : 0x1 Fireplane device asserted an error
Feb 26 12:20:47 systemx DomC.SC: /partition1/domain0/SB1/bbcGroup0/cpuAB/cpusafariagent0:
AFAR (high)[0x531] : 0x0000063c
AFAR [42:32] [10:00] : 0x63c
AFAR (low)[0x541] : 0xff800000
AFAR_2 (high)[0x571] : 0x0000063c      <<<to be used later (for decoding)
AFAR_2 [42:32] [10:00] : 0x63c
AFAR_2 (low)[0x581] : 0xff800000       <<<to be used later (for decoding)
AFSR (high)[0x551] : 0x00080000
PERR [19:19] : 0x1
AFSR_2 (high)[0x591] : 0x00080000
PERR [19:19] : 0x1
EMU B[0x511] : 0x03000000
AID_LK [24:24] : 0x1
NCPQ_TO [25:25] : 0x1

Cause

Interpretation of the example error above:

A System error is detected and Domain C is PAUSED.  From the device path in the error messages it can be determined that the error is detected on SB1 CPU A .
/partition1/domain0/SB1/bbcGroup0/cpuAB/cpusafariagent0
  • The Error Type is an NCPQ_TO.
  • The AFAR_2 0x0000063c.ff800000 (from example above this is taken from the AFAR_2 high and low entries) decodes to:
Non-Cacheable Schizo Device Pair Agent ID 1E Leaf B. (I/O Boat 9 Slots 0,1,2 )
Use Document 1006063.1 for decoding.

Possible Causes:

There are many possible hardware and software root causes for NCPQ_TOs.  They can be caused by faulty CPUs, I/O Bridge ASICs (Schizo), PCI cards as well as Bugs in the Microcode of cPCI/PCI cards.
The following scenarios have been known to cause NCPQ_TOs on Sun Fire 3800-6800 systems:
  • Incorrectly performed upgrade from Firmware 5.11.x to 5.12.x may result in a NCPQ_TO on the first reboot subsequent to the upgrade. This firmware, however, is very old, so should not be an issue now (when 5.20.x is the most recent release).
  • Improper seating of cPCI/PCI Adapters.  If cards have been recently installed or replaced and the error decodes to this location, considering reseating the card to see if the errors cease.
  • Downrev cPCI/PCI adaptor firmware. 

Solution

Troubleshooting:

In general the device indicated by the AFAR_2 is likely to be the cause for the NCPQ_TO.   However the device reporting the error can as well be the cause.  

It is advised to investigate whether the errors are a result of newly installed or moved cPCI or PCI adapters.  Make sure to reseat any newly installed or relocated adapters.  Make sure that drivers are up to date on the cards as well.  

Assuming this is not a newly installed PCI card (or driver issue), please collect an extended Explorer (see Document 1019066.1) and open a Service Request with Support Services.


Internal Troubleshooting Instructions:

If an NCPQ_TO occurs the following steps should be taken to isolate the suspect FRU:

Run POST with a diag level set to default or higher.
@ - If the error is not reproducible escalate the issue to the next level of technical support.
   Provide the necessary logs and explorer data of the Domain & the Sun Fire SC.

- If the error is reproducible please use the AFAR_2.  See Document 1006063.1 for how to
   decode the AFAR_2.  Depending on the AFAR_2 decoding two cases can be differentiated.

Important to note is that the decoding of the AFAR_2 varies with the Firmware Version.
The AFAR_2 decodes to an address in :
     Safari device config area :
     - The AFAR_2 decoding results in a Safari Agent ID # which points to a CPU on a
      CPU/Memory board or a Schizo on an I/O Boat. (Example for firmware 5.12.X)
      
0x00000400.0a400010 -> Safari Agent ID 14(hex), CPU0 on CPU/Memory board 5.

     - By either disabling the CPU or the deleting the CPU/Memory or I/O Boat from the
      configuration, the suspect FRU can be isolated. Run POST for verification.

     Schizo PCI Config & IO and Schizo I/O Board area :
     - The AFAR_2 decoding results in a Safari Agent ID # which is a Schizo and the Leaf on
       that Schizo. (Example for firmware 5.12.X)
 0x00000402.61000380 -> Safari Agent 18(hex)
       Schizo 0 Leaf B, on I/O Boat 6 P0 B1.

     - By this we can only determine the fault down to a leaf. Leaf A supports one card slot,
       leaf B supports multiple card slots.

     - By disabling the Schizo or removing cPCI/PCI cards the error can be eliminated to a single
       component. Do not use the disablecompontent command for slots or leafs on I/O Boats in
       debugging NCPQ_TO.
       To narrow it down to a single card, cards need to be physically remove.

     - If parts are replaced based on a NCPQ_TO please attach the error log to the failing part
       and send it in for CPAS.

References and bug IDs:
Document 1008674.1  Sun Fire (3800-6800): Physical Device Mapping for I/O boats
Document 1006063.1 Address Space Assignment

Keyword:
Sun Fire 6800,Sun Fire 4800, Sun Fire 3800, NCPQ_TO
Previously Published As
48834

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback