Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-77-1001016.1
Update Date:2011-02-09
Keywords:

Solution Type  Sun Alert Sure

Solution  1001016.1 :   Sun Fire 12K/15K/E20K/E25K Domains Running Solaris 8 2/04 May Experience Bus Error When Using Dynamic Reconfiguration  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Sun Alert>Criteria Category>Availability
  •  
  • GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved
  •  

PreviouslyPublishedAs
201342


Product
Sun Fire 12K Server
Sun Fire E20K Server
Sun Fire 15K Server
Sun Fire E25K Server

Bug Id
<SUNBUG: 6532060>

Date of Workaround Release
24-JUL-2007

Date of Resolved Release
20-Jun-2008


1. Impact

When using Dynamic Reconfiguration (DR) to detach the board hosting the permanent memory for a Sun Fire 12K/15K/E20K/E25K domain running Solaris 8 2/04, and the domain is composed of one or more HsPCI+ assemblies, the domain may be interrupted by a "Safari Bus Error" causing a domain outage.


2. Contributing Factors

This issue can occur on the following platforms:

SPARC Platform

  • Sun Fire 12K/15K/E20K/E25K domains running Solaris 8 2/04 without patch 116962-13
Note: Sun Fire 12K/15K/E20K/E25K domains running Solaris 9 and 10 are not affected by this issue.

This issue will only occur if both the following conditions are true:

  1. A Dynamic Reconfiguration (DR) is attempted on the board hosting the permanent memory (kernel)
  2. One or more HsPCI+ boards are installed in the domain


To determine that the domain is composed of HsPCI+ assemblies, the following command can be run:

    sms-svc% showboards -v -d 0 | grep HPCI
    IO0 On HPCI+ Active   Passed 0
    IO1 On HPCI+ Active   Passed 0


The board to be detached hosts the kernel memory board, as in the following example:

    May 10 10:21:08 2007 root# cfgadm -av | grep perm
    May 10 10:21:10 2007 SB1::memory
    connected configured ok
    base address 0x1e000000000, 8388608 KBytes total, 2313832 KBytes permanent

3. Symptoms


During the copy/rename operation, the domain will experience a "Safari Bus Error" causing a domain outage, as in the following example:

    May 10 10:26:01 2007 root# cfgadm -c disconnect SB1
    May 10 10:26:21 2007 System may be temporarily suspended, proceed (yes/no)? yes
    May 10 10:26:30 2007 May 10 10:26:23 DATA01 dr: OS unconfigure dr@0:SB1::cpu0
    May 10 10:26:32 2007 May 10 10:26:25 DATA01 dr: OS unconfigure dr@0:SB1::memory
    May 10 10:28:14 2007
    May 10 10:28:14 2007 DR: checking devices...
    May 10 10:28:14 2007 DR: suspending user threads...
    May 10 10:28:15 2007 DR: suspending kernel daemons...
    May 10 10:28:15 2007 DR: suspending drivers...
    May 10 10:28:15 2007 suspending pci108e,c416@2 (aka sbbc)
    May 10 10:28:15 2007 suspending pci100b,35@0 (aka ce)
    May 10 10:28:15 2007 suspending pci100b,35@1 (aka ce)
    May 10 10:28:15 2007 suspending sd@8,0
    May 10 10:28:15 2007 suspending sd@9,0
    May 10 10:28:15 2007 suspending pci1000,b@2 (aka glm)
    May 10 10:28:15 2007 suspending pci1000,b@2,1 (aka glm)
    May 10 10:28:15 2007 suspending pciclass,060400@1 (aka pci_pci)
    May 10 10:28:15 2007 suspending pci108e,1101@3,1 (aka eri)
    May 10 10:28:15 2007 suspending pciclass,0c0310@3,3 (aka ohci)
    May 10 10:28:15 2007 suspending pciclass,060400@1 (aka pci_pci)
    May 10 10:28:15 2007 suspending pci108e,8002@1c,700000 (aka pcisch)
    May 10 10:28:15 2007 suspending pci100b,35@0 (aka ce)
    May 10 10:28:15 2007 suspending pci100b,35@1 (aka ce)
    May 10 10:28:15 2007 suspending pci100b,35@2 (aka ce)
    May 10 10:28:15 2007 suspending pci100b,35@3 (aka ce)
    May 10 10:28:15 2007 suspending pciclass,060400@1 (aka pci_pci)
    May 10 10:28:15 2007 suspending pci108e,8002@1c,600000 (aka pcisch)
    May 10 10:28:15 2007 Safari bus error: CSR=0155555501c01e77 ErrCtrl=f8000000000003e0
    May 10 10:28:15 2007 IntrCtrl=80000000000fc017 ErrLog=0000000000080000
    May 10 10:28:15 2007 ECC_Ctrl=8000000000000000
    May 10 10:28:15 2007 UE_AFSR=000001025b890138 UE_AFAR=0000088276090900
    May 10 10:28:15 2007 CE_AFSR=0000000d86890111 CE_AFAR=0000014296d76a00
    May 10 10:28:15 2007 FirstErrLog=0000000000080000 FirstErrorAddr=0000000000000000
    May 10 10:28:15 2007 LeafStatus=0000000000000000
    May 10 10:28:15 2007 panic[cpu3]/thread=2a10034fd20: Safari bus error: CSR=0155555501c01e77 ErrCtrl=f8000000000003e0
    May 10 10:28:15 2007 IntrCtrl=80000000000fc017 ErrLog=0000000000080000
    May 10 10:28:15 2007 ECC_Ctrl=8000000000000000
    May 10 10:28:15 2007 UE_AFSR=0000010
    May 10 10:28:16 2007 syncing file systems... done

In the above example, the CSR value points to one of the HsPCI+ assemblies installed in the domain (in this case, CSR=0155555501c01e77 ==> IO0/P0).

In general, some 'dsmd.hwconfig' and 'dsmd.dump' files are dumped as a consequence. Using the 'redx' on the 'dsmd.hwconfig' dump file reports a parity error on the internal memory on the I/O controller pointed to by the CSR value:

    redxl> shioc 0 1 0
    xmits IO00/P0 (0.1.0) Component ID = 34651049 TO_2.1
    ...
Safari_Err_Log[63:0] = 00000000.00080000
Safari_1st_Err_Log[63:0] = 00000000.00080000
Safari_Err_Enbl[63:0] = F8000000.000003E0
Safari_Err_Int_Enbl[63:0] = 80000000.000FC017
ErrLog[19]: 1E Intrupt Internal Parity Error in PCI-B Leaf Logic
1st_Err_Data[59:0] = 0000000.00000000
    ...
    ...

Note: Data is displayed from the currently loaded dump file.


4. Workaround

Until the patch can be applied (or the system is upgraded to a later Solaris OS version), it is recommended to avoid detaching the system board hosting the kernel memory of domains running Solaris 8 2/04 and composed of HsPCI+ assemblies.


5. Resolution

This issue is addressed in the following release:
  • Solaris 8 with patch 116962-13 or later

Modification History
20-Jun-2008: Updated Contributing Factors and Resolution sections; now Resolved

References

<SUNPATCH: 116962-13>

Previously Published As
103016
Internal Comments
Internal Contributor/submitter
[email protected]
Internal Eng Business Unit Group
SSG ES (Enterprise Systems)
Internal Eng Responsible Engineer
[email protected]
Internal Services Knowledge Engineer
[email protected]
Internal Escalation ID
1-21332753, 1-21669795, 1-21893336, 1-21890882, 1-21805750
Internal Sun Alert Kasp Legacy ID
103016
Internal Resolution Patches
116962-13
Internal Sun Alert & FAB Admin Info
Critical Category: Availability ==> Pervasive
Significant Change Date: 2007-07-24
Avoidance: Workaround
Responsible Manager: [email protected]

References

SUNPATCH:116962-13

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback