Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1010580.1
Update Date:2009-05-11
Keywords:

Solution Type  Technical Instruction Sure

Solution  1010580.1 :   How to recognize, diagnose, and troubleshoot PCI SERR errors on UltraSPARC(R) II, IIi, IIe based systems.  


Related Items
  • Sun Ultra 5 Workstation
  •  
  • Sun Enterprise 450 Server
  •  
  • Sun Ultra 450 Workstation
  •  
  • Sun Blade 100 Workstation
  •  
  • Sun Blade 150 Workstation
  •  
  • Sun Ultra 30 Workstation
  •  
  • Sun Ultra 80 Workstation
  •  
  • Sun Enterprise 220R Server
  •  
  • Sun Enterprise 150 Server
  •  
  • Sun Enterprise 250 Server
  •  
  • Sun Ultra 10 Workstation
  •  
  • Sun Ultra 60 Workstation
  •  
  • Sun Enterprise 420R Server
  •  
Related Categories
  • GCS>Sun Microsystems>Desktops>Workstations
  •  
  • GCS>Sun Microsystems>Servers>Entry-Level Servers
  •  

PreviouslyPublishedAs
214556


Description
The following document provides input on what actually causes a PCI SERR,
what do the PCI debug drivers actually do, and how to determine driver
output after a SERR on UltraSparc II, IIi, IIe based systems,

Steps to Follow
--- WHAT CAUSES A PCI ERROR?
      In a modern pci network, there can be many nested pci buses. PCI buses
are connected by pci-to-pci bridges. The bus nearest the machine is the
primary bus, and the one away is the secondary bus. In a fairly complex
system with an E1 pci extender unit, there can be six or more hierachical
buses.
In a Solaris[TM] systems, there is a device that translates from the
system bus (UPA or other architecture-specific system bus type) to the pci
bus(es). With reference to UltraSparc II, IIi, and IIe based systems, this
chip is the Psycho (upa-to-pci), and is managed by PCI bus nexus drivers of
similar names.
The topology of the pci buses below these devices is machine
dependant. Many low cost machines have simba pci-to-pci bridges, that link
one 64bit bus to two separate 32 bit pci buses. Often one bus is used to
drive onboard devices like consoles/ethernets etc, and the other bus is
used to drive the available pci slot.
Many pci cards have onboard pci-to-pci bridges so that the multiple
devices are hidden behind a bridge.
The SERR signal may be pulsed by any pci device to report:
* address parity errors;
(a parity error detected during an address cycle that is bad,
and  no device knows who should respond, so every device that
sees the parity error asserts SERR;
* data parity errors during special cycles;
* critical errors other than parity errors.
( data trapped in a bridge with no way to inform the sender
that itis being dropped ).
As your data flows throught the various pci-to-pci bridges that need
to be negotiated to get to the final device, any device on any of the
intermediate buses could assert SERR and not just the final device.
The SERR signal is passed up from secondary side of a bridge to the
primary side until it percolates up to the pci nexus device which gets an
interrupt and prints out a few lines before the machine panics.
The standard pci nexus drivers do not walk the bus looking to see
where an error was asserted.  The drivers just dump the status register
from the top level node which is actually part of the nexus device.  All
one can say is that some component below the pci bus nexus chip asserted
SERR for some reason.  Some machine configurations have only one device er
bus, so identifying what is complaining is quite easy, but why is a
different matter.
---  WHAT DO THE DEBUG DRIVERS DO?
The driver works by walking the device tree below the nexus driver
that gets the SERR interrupt. At each node, it will extract the common pci
status/command registers. Then for nodes that it reconizes, it will extract
more information, (eg for a pci-to-pci bridge chip it will get registers
from both sides as well as chip specific error registers).
---  DRIVER OUTPUT AFTER A SERR:
NOTICE: SIMBA pcipsy-0/simba-1   0x108e 0x5000 0x81 0x147 0x42a0 0x4280
0x23 0x0
NOTICE: SIMBA  pcipsy-0/simba-1  0x0 0x0 0x0 0x0
NOTICE:        simba-1/glm-0 -   0x1000 0xb 0x80 0x146 0x210
NOTICE:        simba-1/glm-1 -   0x1000 0xb 0x80 0x146 0x210
NOTICE: DEC 21152 simba-1/pci_pci-0 0x1011 0x22 0x1 0x147 0x4290 0x6280
0x23 0x0
NOTICE: DEC 21152 pci_pci-0/pci_pci-1 0x1011 0x26 0x1 0x147 0x4290 0x6280
0x23 0x0
NOTICE: DEC 21152 pci_pci-1/pci_pci-4 0x8086 0xb152 0x1 0x147 0x4290
0x2280 0x23 0x10
NOTICE:       pci_pci-4/pci108e,1000--1 - 0x108e 0x1000 0x80 0x2 0x280
NOTICE:       pci_pci-4/hme--1 - 0x108e 0x1001 0x80 0x146 0x280
NOTICE:       pci_pci-4/isp-1  - 0x1077 0x1020 0x0 0x157 0x200
panic[cpu0]/thread=300028cbce0: pcipsy-0: PCI bus 1 error(s)!
The format of each line is:
NOTICE nameparent/child   vid  did conf  command  status
If the name is not null then the driver has recognized the device and
will get more registers, those exceptions  are listed below.
The "parent/child" line shows the device tree relationships, sO a
goodexample of a generic line is:
NOTICE: simba-1/glm-0 - 0x1000 0xb 0x80 0x146 0x210.
No "name" so we pull just the generic registers which are the following:
VALUE      NAME       DESCRIPTION
0x1000     vid        16 bit, offset 0, pci vendor id  , 0x1000 = LSi
0xb        did        16 bit, offset 2, specific device id - look at
vendors website.
0x80       header     8 bit, offset 0xe
0x146      command    16 bit, offset 4,  command register
0x210      status     16 bit, offset 0x6, status register
--- INTERPRETING THE OUTPUT.
1) treat the data path between the nexus driver and the final device as a
tree.
2) start from the nexus driver and examine the status registers looking
for header some indication of a received error in the example:
pcipsy-0/simba-1  0x108e 0x5000 0x81 0x147 0x42a0 0x4280 0x23 0x0
If we decode the status register for the primary bus, we get0x42a0 which
means b0100 0010 1010 0000;
bit 14:
signalled system error is the only abnormal bit set.
So we took the SERR panic because the simba instance 1 which is below
pcipsy instance 0 indicated a SERR upwards.
So now look at the secondary bus status 0x4280 which means;
b0100 0010 1000 0000
bit 14:
received SERR.
So we know that the simba bridge was just passing on the SERR it
received on its secondary bus.
So what devices have simba-1 as a parent as it could be any one of
them asserting SERR.
NOTICE:           simba-1/glm-0 -      0x1000 0xb  0x80 0x146 0x210
NOTICE:           simba-1/glm-1 -      0x1000 0xb  0x80 0x146 0x210
NOTICE: DEC 21152 simba-1/pci_pci-0    0x1011 0x22 0x1  0x147 0x4290
0x6280 0x23 0x0
The status registers of the glm units are both 0x210 which are normal
so they are not the culprit, but the dec 2115X bridge chip has 0x4290 as
its primary bus status and that means;
b0100 0010 1001 0000
bit 14:
set so it was asserting SERR on the bus up to the simba-1
device.
Looking at its secondary bus status we see 0x6280, which means;
b0110 0010 1000 0000
bit 14:
received SERR on this bus.
bit 13:
received master abort, so while we were a master on this bus,
the transaction was aborted by the target with a master abort
not good.
So now we know the serr was sent up from pci_pci instance 0 because a
child node asserted SERR on its secondary bus.
--  SO, WHAT DEVICE IS BELOW THE PCI_PCI 0?
NOTICE: DEC 21152 pci_pci-0/pci_pci-1  0x1011 0x26 0x1 0x147 0x4290
0x6280 0x23 0x0
Again the same status, so let us move further down and determine who
has pci_pci-1 as a parent.
NOTICE: DEC 21152 pci_pci-1/pci_pci-4 0x8086 0xb152 0x1 0x147 0x4290
0x2280 0x23 0x10
So here we see the primary bus status 0x4290 shows that it asserted
SERR, but the secondary bus status shows 0x2280 received master abort. So
we know where the SERR originated.
But if we ask why, we will ascertain that the last register for this
device is 0x10 and that is the p_serr_l_status, or why this device asserted
SERR.
p_serr_l_status = 0x10 means master abort during posted write.
So we had some buffered data in the bridge that we told the master ad
been delivered, but when we sent it to the target we got back a master
abort. A master abort happens when no target claims the address for an
existing transaction.
So we know why we got a SERR, and we know who
pcipsy-0/simba-1
simba-1/pci_pci-0
pci_pci-0/pci_pci-1
pci_pci-1/pci_pci-4
got a master abort for some buffered data who it could not cry to the
sender for help. what devices are below the pci_pci@4 device.
NOTICE:  pci_pci-4/pci108e,1000--1 - 0x108e 0x1000 0x80 0x2 0x280
NOTICE:  pci_pci-4/hme--1          - 0x108e 0x1001 0x80 0x146 0x280
NOTICE:  pci_pci-4/isp-1           - 0x1077 0x1020 0x0 0x157 0x200
So can we work out which device? NO! All we know is where to put our
analyzer, and which drivers to instrument up in order to determine who is
accessing an address that no one curently owns.
EXCEPTION DEVICES:
SIMBA DOCUMENTATION:
The simba pci device is a pci-to_pci bridge, that has both a 64
bit primary bus and two 32bit secondary buses, managed by a
solaris simba driver.
The format is...
NOTICE: SIMBA pcipsy-0/simba-1  0x108e 0x5000 0x81 0x147 0x42a0
0x4280 0x23
NOTICE: SIMBA pcipsy-0/simba-1               0x0    0x0 0x0 0x0
So, there are 11 registers gathered.
Value      Name           Description
0x108e     vid            16 bits, offset 0,     0x108e = sun
0x5000     did            16 bits at offset 2    0x5000 = simba
0x81       header         8 bit at offset 0xe
0x147      command        16 bits, offset 0x4,   command register
0x42a0     status         16 bits at offset 0x6  status register
0x4280     secondary bus  16 bit  at offset 0x1e secondary bus
status         status register
0x23       bridge control 16 bit register at 0x3e
0x0        dma_afsr       64 bit vaue at offset 0xc8
0x0        dma_afar       64 bit value at offset 0xd0
0x0        pio_afsr       pio_afsr  64 bit vaue at offset  0xe8
0x0        pio_afar       pio_afar 64 bit vaue at offset 0xf0
DEC 2115[234] DOCUMENTATION:
The DEC 2115X pci-to-pci bridge is quite common, and the 2115X
family of pci-to-pci bridge chips are now released by intel. It
is managed by the pci_pci solaris driver.
NOTICE: DEC 21152 simba-1/pci_pci-0  0x1011 0x22 0x1 0x147
0x4290 0x6280 0x23 0x0
INTEL 21554 DOCUMENTATION (managed by the db21554 driver):
Value    Name              Description
0x1011   vid
0x22     did
conf
0x147    command
0x4290   status
secondary command  16 bits at 0x44
0x6280   secondary status   16 bits at 0x46
cc0                16 bit diagnostics at 0xcc
cc1                16 bit diagnostics at 0xce
cc2                16 bit diagnostics at 0xd0
--  COMMAND REGISTER FORMAT(16 bits):
BIT       MEANING
0       I/O space address decoder enable
1       Memory space address decoder enable
2       Bus master enable
3       Special cycles enable
4       Memory write-and-invalidate enable
5       VGA palette snoop enable
6       PERR generation enable
7       Address stepping enable
8       SERR enable
9       FAST back-to-back enable
10-15   reserved
-- STATUS REGISTER FORMAT (16 bits)
BIT      MEANING
0-3     Reserved
4       2.2 capable
5       66 Mhz capable
6       Reserved
7       Fast back to back capable
8       Master data parity error
9-10    Dev sel timing
11      Signaled target abort
12      Received target abort
13      Received master abort
14      Signaled SERR on the bus
15      Detected a parity error
--  SECONDARY BUS STATUS REGISTER FORMAT (16 BITS)
BIT      MEANING
0-3      Reserved
4        2.2 capable
5        66 Mhz capable
6        Reserved
7        Fast back to back capable
8        Data parity reported
9-10     Dev sel timing
11       Signaled target abort
12       Received target abort
13       Received master abort
14       Received SERR
15       Detected a parity error


Product
NEBS-Certified Servers
Sun Enterprise 450 Server
Sun Enterprise 420R Server
Sun Enterprise 250 Server
Sun Enterprise 220R Server
Sun Enterprise 150 Server
Ultra 80 Workstation
Ultra 450 Workstation
Ultra 60 Workstation
Ultra 5 Workstation
Ultra 30 Workstation
Ultra 10 Workstation
Sun Blade 150 Workstation
Sun Blade 100 Workstation

Internal Comments
Additional Documentation and Debug Drivers are available internally here:

http://clem.uk/~timu/pci/index.html

pci, serr, errors, simba, psycho, ultrasparc, II, IIi, IIe, decode, diagnose
Previously Published As
72402

Change History
Date: 2009-02-26
The product NEBS-Certified Servers is a folder in the Swordfish database. We need specific products in the product statement. If you can provide a list of the specific servers that this article applies to, the document can be published. The specific products will show up with a plus sign rather than a folder in http://krep.central.sun.com/stats/swordfish/.
Date: 2005-09-23
User Name: 7058
Action: Update Canceled
Comment: *** Restored Published Content *** Only fixed metadata.
Version: 0
Date: 2005-09-23
User Name: 7058
Action: Update Started
Comment: Fixing missing tech group.
Version: 0

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback