Document Audience:INTERNAL
Document ID:I0852-1
Title:Some pcisch driver panics on F15K systems are unrelated to failed hardware.
Copyright Notice:Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved
Update Date:2004-04-02

---------------------------------------------------------
           - Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                        FIELD INFORMATION NOTICE
               (For Authorized Distribution by Sun Services)
FIN #: I0852-1
Synopsis: Some pcisch driver panics on F15K systems are unrelated to failed hardware.
Create Date: Mar/07/03
SunAlert: No
Top FIN/FCO Report: No
Products Reference: Sun Fire 15K
Product Category: Server / Service
Product Affected: 
Systems Affected:
-----------------  
Mkt_ID   Platform   Model   Description          Serial Number
------   --------   -----   -----------          -------------
  -        F15K      ALL    Sun Fire 15000             -
  

X-Options Affected:
-------------------
Mkt_ID   Platform   Model   Description                      Serial Number
------   --------   -----   -----------                      -------------
X6272A      -         -     Tape 16-32GB/4MM DDS2 68/50PIN         -
Parts Affected: 
Part Number            Description                         Model
-----------            -----------                         -----
375-3030-01            PCI Dual FC Network Adapter+          -	
375-3019-01            PCI Single FC Host Adapter            -
501-6302-03 or lower   hsPCI I/O Board (w/ Cassettes)        -
501-5397-11 or lower   hsPCI I/O Board (w/o Cassettes)       -
501-5599-07 or lower   3.3V hsPCI Cassette                   -
References: 
BugId:  4699182 - OS panics w/ PCI SERR that H/W replacements 
                  don't alleviate.

ESC:    537306 - SWON/LT/ system generated core file.

MANUAL: 806-3512-10: Sun Fire 15K System Service Manual.

URL:    http://sunsolve.central.sun.com/data/816/816-5002/pdf/
Issue Description: 
The pcisch driver may panic on Sun Fire 15000 domains due to a parity
error on the PCI Bus.  In most cases this is due to a faulty hardware
component.  However, in some cases the panic cannot be corrected by
replacing a hardware FRU.  This second scenario may result in multiple
unexpected domain failures if not corrected.  This FIN describes how to
diagnose and correct this type of pcisch driver panic.

It is important to note that the panic stack for this problem is
IDENTICAL to the panic stack that is produced as a result of bad
hardware.  It is imperative when diagnosing these types of errors that
the field troubleshoot the issue as faulty hardware first.  Only after
the panic persists or moves instances repeatedly, should the field
attribute the problem to the issue outlined in this FIN.

Panics in the pcisch driver cover a wide range of possible failures.
In this case, the control status register (CSR) calls out the detection
of bad parity on the PCI bus:

  WARNING: pcisch-19: PCI fault log start:
  PCI SERR
  PCI error occurred on device #0
  dwordmask=0 bytemask=0
  pcisch-19: PCI primary error (0):pcisch-19: PCI secondary error (0):pcisch-19: 
       PBM AFAR 0.00000000:WARNING: pcisch19: PCI config space 
       CSR=0xc2a0
  pcisch-19: PCI fault log end.

  panic[cpu128]/thread=2a10001fd20: pcisch-19: PCI bus 3 error(s)!

  000002a10001bea0 pcisch:pbm_error_intr+148 (30000b643d8, 2772, 30000b84548, 3, 
        30000b643d8, 3)
    %l0-3: 00000300008b9860 0000000000004000 0000000000000000 0000030000b86584
    %l4-7: 00000300009978c8 0000030008d03ea8 0000000000000000 0000030008d03ed0
  000002a10001bf50 unix:current_thread+44 (0, ffffffffffffffff, 0, 300335b3528, 
        0, 1044f340)
    %l0-3: 0000000010007450 000002a10001f061 000000000000000e 0000000000000016
    %l4-7: 0000000000010000 00000300339922a8 000000000000000b 000002a10001f910
  000002a10001f9b0 unix:disp_getwork+40 (1044e398, 0, 1044f340, 10457310, 2, 0)
    %l0-3: 000000001010e2d8 0000000010509e00 00000300335bd518 000002a100c37d20
    %l4-7: 000002a100cebd20 0000000002736110 0000000000000000 000002a10001f9c0
  000002a10001fa60 unix:idle+a4 (0, 0, 80, 1044e398, 3000096d980, 0)
    %l0-3: 0000000010043d58 2030205b275d2076 616c20696e646578 000002a10011dd20
    %l4-7: 70636220290a2020 202e22202073703a 20222031205b275d 2076616c20696e64

NOTE:  The stack itself can be different, depending on each specific case.  
       What matters is the CSR values (specifically the 
       "detected-parity-error" bit).

With every other panic of this nature, a hardware replacement has
resolved the case.  However, with one customer, repeated hardware
replacements did not resolve the issue.  The customer's issue has since
been replicated on multiple machines in an engineering environment.
There are some unique factors that are needed to create this scenario:

  A. To date, this problem has only been seen on 375-3030 (Crystal+) 
     cards.
  B. All the panics have been in either slot 0 or slot 2 of the I/O Boat. 
     (Slots 0 and 2 is the lower 66 MHz slots)
  C. Schizo 2.3 seems to bring the problem out with more regularity.
  D. Veritas software (specifically adding mirrors to volumes) seems 
     to increase the likelihood of failure.

Steps for Diagnosis
===================

As a reminder, when looking at an F15K I/O boat, the slots are designated:

   -----------------------------------------------------
  | Schizo 1, leaf B (33Mhz) | Schizo 0, leaf B (33Mhz) |
  |--------------------------+--------------------------|
  | Schizo 1, leaf A (66Mhz) | Schizo 0, leaf A (66Mhz) | 
   -----------------------------------------------------

  		OR

   -----------------
  | Slot 3 | Slot 1 |
  |   OR   |   OR   |
  | X.1.1.1| X.1.0.1|
  |--------+--------|
  | Slot 2 | Slot 0 |
  |   OR   |   OR   |
  | X.1.1.0| X.1.0.0|
   -----------------

  NOTE: X = hsPCI number (0-17)

To diagnosis the pcisch panic from the above stack, follow these steps:

 1. Use the /etc/path_to_inst file on the domain or the cfgadm/rcfgadm
    commands to isolate the slot.  For example, using the two methods with 
    the panic above (pcisch-19):

       # grep pcisch /etc/path_to_inst

    "/pci@3d,600000" 7 "pcisch"
    "/pci@1c,700000" 0 "pcisch"
    "/pci@3c,700000" 4 "pcisch"
    "/pci@9d,600000" 19 "pcisch"  <----------
    "/pci@9c,600000" 17 "pcisch"
    "/pci@3c,600000" 5 "pcisch"
    "/pci@5d,600000" 11 "pcisch"
    "/pci@7d,600000" 15 "pcisch"
    "/pci@1c,600000" 1 "pcisch"
    "/pci@1d,600000" 3 "pcisch"
    "/pci@5c,700000" 8 "pcisch"
    "/pci@7c,700000" 12 "pcisch"
    "/pci@7c,600000" 13 "pcisch"
    "/pci@9c,700000" 16 "pcisch"
    "/pci@9d,700000" 18 "pcisch"
    "/pci@3d,700000" 6 "pcisch"
    "/pci@5c,600000" 9 "pcisch"
    "/pci@1d,700000" 2 "pcisch"
    "/pci@7d,700000" 14 "pcisch"
    "/pci@5d,700000" 10 "pcisch"
    "/pci@11c,700000" 20 "pcisch"
    "/pci@11c,600000" 21 "pcisch"
    "/pci@11d,700000" 22 "pcisch"
    "/pci@11d,600000" 23 "pcisch" 

    In this case, instance 19 is "/pci@9d,600000".  To translate that into a
    slot location, break down the 9d into binary <10011101>, then add a space 
    to obtain <100 1110 1>.  That address now breaks down to slot 4 (100), 
    skip the middle section (1110), pci 1 (or the pci slot on the left).

    The other option is to use the conversion which the dynamic 
    reconfiguration interface provides:

       # rcfgadm -d a -la | grep pcisch

    pcisch0:e00b1slot1    pci-pci/hp   connected   configured     ok
    pcisch10:e02b1slot3   unknown      connected   unconfigured   unknown
    pcisch11:e02b1slot2   pci-pci/hp   connected   configured     ok
    pcisch12:e03b1slot1   pci-pci/hp   connected   configured     ok
    pcisch13:e03b1slot0   pci-pci/hp   connected   configured     ok
    pcisch14:e03b1slot3   unknown      connected   unconfigured   unknown
    pcisch15:e03b1slot2   pci-pci/hp   connected   configured     ok
    pcisch16:e04b1slot1   unknown      connected   unconfigured   unknown
    pcisch17:e04b1slot0   pci-pci/hp   connected   configured     ok
    pcisch18:e04b1slot3   unknown      connected   unconfigured   unknown
--> pcisch19:e04b1slot2   unknown      empty       unconfigured   unknown
    pcisch1:e00b1slot0    unknown      empty       unconfigured   unknown
    pcisch20:e08b1slot1   unknown      empty       unconfigured   unknown
    pcisch21:e08b1slot0   pci-pci/hp   connected   configured     ok
    pcisch22:e08b1slot3   unknown      empty       unconfigured   unknown
    pcisch23:e08b1slot2   unknown      empty       unconfigured   unknown
    pcisch2:e00b1slot3    unknown      connected   unconfigured   unknown
    pcisch3:e00b1slot2    pci-pci/hp   connected   configured     ok
    pcisch4:e01b1slot1    pci-pci/hp   connected   configured     ok
    pcisch5:e01b1slot0    unknown      empty       unconfigured   unknown
    pcisch6:e01b1slot3    unknown      connected   unconfigured   unknown
    pcisch7:e01b1slot2    pci-pci/hp   connected   configured     ok
    pcisch8:e02b1slot1    pci-pci/hp   connected   configured     ok
    pcisch9:e02b1slot0    unknown      connected   unconfigured   unknown

    In this case, the issue is on expander 4 (ex4), I/0 board (b1), slot 2.

 2. Once you identify the correct location, there are three FRUs
    which could be causing the parity error: the hsPCI (also called the
    I/O boat), the 3.3v cassette, or the adapter itself.  SUN has
    agreed to allow the replacement of all three of these FRUs for a
    single failure.  Replace all three.

The root cause for pcisch driver panics, which are unrelated to faulty
hardware, is still under investigation.  There is no final fix at this
time.  In the meantime, use the recommended workarounds mentioned in
the Corrective Action section below.

Because some customers had x6272A adapters,  FCO was necessary to resolve
such a problem and all Crystal+ cards within the domain should be replaced.
Implementation: 
---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        |   |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---
Corrective Action: 
The following recommendation is provided as a guideline for authorized
Enterprise Services Field Representatives who may encounter the above
mentioned issue.

SUN policy is to replace all three potential FRUs. Therefore, replace
all three.  In order to implement this FRU replacement on a adapter is
a Crystal+ card (x6727A), the FCO is currently in the process.  Please
implement below procedure until the release of the FCO.

Troubleshoot pcisch driver panics on F15K domains as outlined above.
If the problem is determined NOT to be caused by faulty hardware,
implement one of the three workarounds below.

  A. Replace the 375-3030 (Crystal+) cards with 375-3019 (Amber) cards.
     This has been shown to alleviate the issue after extensive testing. 
     
OR

  B. Move all 375-3030 cards to either slot 1 or slot 3.  This assumes  
     there are enough I/O boats.

OR

  C. Upgrade the 375-3030 (Crystal+) cards to 375-3108 (Crystal-2A). 
     This will require new drivers to be installed and LC-SC or LC-LC 
     Fibre Cables.  See Product Note 816-5002 for details:

     http://sunsolve.central.sun.com/data/816/816-5002/pdf/
Comments: 
None

============================================================================
Implementation Footnote: 
i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.central/FIN_FCO/index.html

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.central/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://spe.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------
Statusactive