Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1009008.1
Update Date:2010-07-15
Keywords:

Solution Type  Problem Resolution Sure

Solution  1009008.1 :   Sun Fire[TM] 3800, 4800, 4810, 6800, E4900, E6900, V1280, E2900, Netra 1280, or 1290 server: NCPQ_TO workarounds  


Related Items
  • Sun Fire E4900 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Netra 1280 Server
  •  
  • Sun Fire E6900 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Netra 1290 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire 4810 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  

PreviouslyPublishedAs
212426


Applies to:

Sun Netra 1280 Server
Sun Fire V1280 Server
Sun Fire 3800 Server
Sun Fire 4800 Server
Sun Fire 4810 Server
All Platforms

Symptoms

An NCPQ_TO (Non-Coherent Pending Queue Time-Out) will cause a domain to error pause.  This means the domain in question will be reset in order to be recovered (it will shut down and restart - reboot).

This error pause will appear on the SC platform shell (console messaging) or in the showlogs file as:

ErrorMonitor: Domain A has a SYSTEM ERROR

Note:  The domain in question will change depending on which domain actually encounters the error.

An NCPQ_TO will also appear on the SC domain shell and in showlogs as:

/partition0/domain0/SB0/bbcGroup1/sbbc1:
>>> ErrorStatus[0x80] : 0x80008200
FE [15:15] : 0x1
ErrSum [31:31] : 0x1
SafErr [09:08] : 0x2 Fireplane device asserted an error
/partition0/domain0/SB0/bbcGroup1/cpuCD/cpusafariagent1:
AFSR (high)[0x551] : 0x00080000
PERR [19:19] : 0x1
EMU B[0x511] : 0x00c00000
AID_LK [22:22] : 0x1 ATransID leakage error
NCPQ_TO [23:23] : 0x1 NCPQ system bus time-out
>>

On an UltraSPARC[R] III machine, the bit signalling NCPQ_TO changes slightly:

/partition0/domain0/SB0/bbcGroup1/cpuCD/cpusafariagent0:
AFAR (high)[0x531] : 0x0000041c
AFAR [42:32] [10:00] : 0x41c
AFAR (low)[0x541] : 0x009115e0
AFAR_2 (high)[0x571] : 0x0000041c
AFAR_2 [42:32] [10:00] : 0x41c
AFAR_2 (low)[0x581] : 0x009115e0
AFSR (high)[0x551] : 0x00080000
PERR [19:19] : 0x1
AFSR_2 (high)[0x591] : 0x00080000
PERR [19:19] : 0x1
EMU B[0x511] : 0x03000000
AID_LK [24:24] : 0x1 ATransID leakage error
NCPQ_TO [25:25] : 0x1 NCPQ system bus time-out


Cause

Sun Fire[TM] 3800, 4800, 4810, 6800, E4900, E6900, V1280 and Netra[TM] 1280 & 1290 servers can be susceptible to NCPQ_TO error pause conditions that might not be resolved by hardware replacement.

The usual resolution for repeat NCPQ_TO error pause conditions can usually be effected by following the process described in Document 1017926.1.

NCPQ_TO error pause on this range of server platforms with JNI 1063 Fibre Channel, Marconi/FORE HE155 ATM, and Sun cPCI Dual Fibre Channel Network Adapters (Part Number 375-0118) HBAs, tend not to be resolved by hardware replacement as described in Document 1017926.1. They are more likely due to problems described in Sun BugID's 4836915, 4859295, 4919824, and 4408474.

The JNI 1063 HBA is not supported on these servers, but it often sold with Hitachi Data Systems(HDS) 99xx disk arrays.  A JNI 1063 card on one of these servers will appear in prtdiag output as:

fibre-channel-pci1242,4643.0

The card itself will have a sticker with the following on it:

FCI-1063

Another clue is that pkginfo will list a package:

JNIfcaPCI

A format command will list a device path like:

/ssm@0,0/pci@1d,700000/fibre-channel@3/sd@7d,0

The Marconi/FORE ATM 155 card is also unsupported.  A Marconi/FORE ATM 155 card will appear in prtdiag output as:

FORE,HE-155

A 375-0118 cPCI Dual Fibre Channel HBA will appear in prtdiag output as:

SUNW,qlc-pci1077,2200.1077.4084.+

A format command will list a device path like:

/ssm@0,0/pci@1d,600000/pci@1/SUNW,qlc@4/fp@0,0/ssd@w50060e80039d5d07,0


Workarounds exist for cases involving JNI 1063 Fibre Channel, Marconi/FORE HE155 ATM, and Sun cPCI Dual Fibre Channel Host Bus Adapters HBAs (Part Number 375-0118).


Solution

First, validate that ScApp is at least at 5.20.6 (which includes the latest NCPQ_TO fix). See Patch ID 114527 if needing to upgrade ScApp and you are advised to load the latest release.

  • If ScApp 5.20.6 or higher is installed, and your configuration includes one of the HBAs detailed below, proceed to the instructions below the Workaround section below.
  • If the information in this article does not help you resolve this issue, please open a Service Request with Support Service and provide Explorer data from the main SC per the instructions in Document 1019066.1.

Workaround:

For cases involving a JNI 1063 Fibre Channel HBA:

NCPQ_TO error pause has been observed on these servers with the JNI cards when they have been left disconnected.

Typically, the JNI 1063 card is used in a Fibre Channel-Arbitrated Loop (FC-AL) topology. When the loop is open (as is the case where no device or FC-AL hub is attached), the card will continuously reset. The resets have been linked to eventual NCPQ_TO error pause.

The following, repeating message is often seen in the /var/adm/messages file on these Sun Fire systems with JNI 1063 cards, which have experienced NCPQ_TO error pause:

fca-pci0: Link Failure. Resetting...

The resets can be eliminated by completing the FC-AL loop, or removing the JNI 1063 card. The loop can be completed by attaching an FC-AL device, an FC-AL hub, or by inserting an external loopback plug into the JNI 1063 HBA.

Replacing the JNI 1063 card with a Sun PCI Dual FC Network Adapter+ (Part Number 375-3030) HBA might also resolve the problem.


For cases involving a Marconi/FORE HE155 ATM HBA:

NCPQ_TO error pause has been observed on these servers with these cards, when they share a PCI bus with other PCI devices.

The other devices can be other HBA cards, or embedded devices which share a PCI bus wth PCI slots 0, 1, and 2 on a Sun Fire server 8 slot PCI I/O Assembly, (PN 501-4404) I/O boat.  Isolating the Marconi card to it's own PCI bus has been shown to prevent the NCPQ_TO error pause conditions.

To accomplish this, the card should EITHER be placed in one of the 66 Mhz slots:

  • slot 3 for Schizo 0, leaf A
  • slot 7 for Schizo 1, leaf A

OR be the ONLY card installed on one of the 33Mhz slots:

  • slots 0, 1 and 2 for Schizo 0, leaf B
  • slots 4, 5,and 6 for Schizo 1, leaf B.

If installed in one of the 33Mhz slotes, the Marconi card should be the ONLY CARD in any of those three slots, and the other two slots on the bus should be EMPTY.


For cases involving a Sun cPCI Dual Fibre Channel HBA (Part Number 375-0118):

NCPQ_TO error pause has been diagnosed as Sun BugID's 4859295 and 4408474 when this HBA is installed.

The problem may be prevented by using only one port on this dual fibre channel port card. It may be neccessary to supply an additional Sun cPCI Dual Fibre Channel HBA (PN 375-0118), to spread the fibre connections to one per HBA.

It is also possible to swap all of the cards within an I/O boat from cPCI to their PCI equivalents. For Sun Fire 4800-6800 servers which use the 4 slot cPCI I/O boats, this will also require swapping 4 slot cPCI I/O Assemblies (501-4868) for 8 slot PCI I/O Assemblies (501-4404) .

For Sun Fire 3800 servers, this will require swapping the entire server for another UltraSPARC III platform.


Internal Only Additional Information

Sun Bugs:
CR 4836915 NCPQ timeouts on Serengeti difficult to diagnose
CR 4859295 NCPQ_TO error hangs the 3800/cPCI
CR 4919824 NCPQ_TOs fails Serengeti DS-6800 s9 domain
CR 4408474 DSTOP during disk I/O
CR 4899682 NCPQ_TO System hang occurs during solaris reboot using FW 5.14

Knowledge articles references:
Document 1006063.1 SF 3800-6800: Non-Cacheable Address Space tables
Document 1017926.1 SF 3800-6800: Troubleshooting NCPQ_TO errors

Previously Published As
71760 & 212426

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback