Solaris Reboot Triggers Spurious SYSTEM Error in Adjacent Domain

Asset ID:	1-77-1000008.1
Update Date:	2011-02-18
Keywords:

Solution Type Sun Alert Sure

Solution 1000008.1 : Solaris Reboot Triggers Spurious SYSTEM Error in Adjacent Domain

Related Items


Sun Fire E6900 Server
 Sun Fire 3800 Server
 Sun Fire 6800 Server
 Sun Fire E4900 Server
 Sun Fire 4800 Server
 Sun Fire 4810 Server

Related Categories


GCS>Sun Microsystems>Sun Alert>Criteria Category>Availability
 GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved

PreviouslyPublishedAs
200010

Product
Sun Fire 3800 Server
Sun Fire 4800 Server
Sun Fire 4810 Server
Sun Fire 6800 Server
Sun Fire E6900 Server
Sun Fire E4900 Server

Bug Id
<SUNBUG: 6300392>

Date of Workaround Release
27-JUL-2005

Date of Resolved Release
13-FEB-2006

Impact

Hardware error pause for AR L2CheckError may be asserted, causing an abrupt halt to processing within a domain, and hardware replacement will not resolve the issue.

Note: Internal testing has shown that L2CheckErrors of the type described in this alert can be reproduced with any firmware version lower than 5.19.7 or 5.20.3 by simulating an IO board DC-DC converter failure.

Contributing Factors

This issue can occur on the following platforms:

Sun Fire 3800, 4800, 4810, E4900, 6800, and E6900 systems without ScApp firmware 5.19.8 or 5.20.3 (as delivered in patches 114526-09 and 114527-04)

Notes:

This error is rare, and only occurs in configurations with more than one domain per partition - those running Domain B with Domain A, or Domain D with Domain C.
Systems running only Domain A, or those running only Domains A and C are not affected by this issue.
In some cases, this type of error has been preceded by failureof a DC-DC converter on an I/O board in one of the affected domains.(Reproduced in the lab by simulating an IO board DC-DC converter failure).

To determine the version of ScApp on a system, the following command can be run (from the platform shell):

    sc0:SC> showsc
    ...
    ScApp version: 5.19.4 Build_01
    RTOS version: 45

Symptoms

A Solaris reboot will cause the adjacent domain to fail with error pause. (Adjacent domains are those running within the same partition, either A and B or C and D).

For a case where a Solaris reboot of Domain A causes a failure in Domain B, messages similar to the following may be seen on the SC Platform shell:

    Domain Reboot A: Initiating keyswitch: on, domain A.
    ErrorMonitor: Domain B has a SYSTEM ERROR
    [AD] Event: SF6800.ASIC.AR.CMDV_SYNC_ERR.102420cf

These messages may be seen on the SC Domain B shell:

    ErrorMonitor: Domain B has a SYSTEM ERROR
    /N0/SB3 encountered the first error
    /N0/IB8 encountered the first error
    ArAsic reported first error on /N0/SB3
    /partition0/domain1/SB3/ar0:
    >>> L2CheckError[0x6150] : 0x01808100
AccIncSyncErr [24:21] : 0xc accumulated incoming mismatch
FE [15:15] : 0x1
INCSyncErr [08:05] : 0x8 Ports [9:6] incoming mismatched against internal expected incoming
    ArAsic reported first error on /N0/IB8
    /partition0/domain1/IB8/ar0:
    >>> L2CheckError[0x6150] : 0x18189010
CMDVSyncErr [12:09] : 0x8 Ports [9:6] command valid mismatched against internal expected command valid
PreqSyncErr [04:01] : 0x8 Ports [9:6] prereq mismatched against internal expected prereq
AccCMDVSyncErr [28:25] : 0xc accumulated valid command mismatch
FE [15:15] : 0x1
AccPreqSyncErr [20:17] : 0xc accumulated prerequisite mismatch
    [AD] Event: SF6800.ASIC.AR.CMDV_SYNC_ERR.102420cf

In this case, each SB and IB in the failing domain will report AR L2CheckError with either INCSyncErr or CMDVSyncErr. The adjacent domain which was being rebooted may reboot just fine.

Note: ArAsic indicates that this error was detected by the Address Repeater (AR) ASIC (Application-Specific Integrated Circuit) within the Sun Fireplane Switch. The AR L2CheckError indicates unexpected behavior of the switch's distributed arbitration protocol.

The error will be repeatable, a reboot of one domain causing the adjacent domain to fail, until the master system controller (SC) has been rebooted. A failover to the spare SC will have the same effect. Hardware replacement of the various FRUs which contain the Sun Fireplane Switch have no effect.

Workaround

To temporarily work around the described issue, reboot the primary SC with the "reboot" command.

Resolution

This issue is addressed on the following platforms:

Sun Fire 3800, 4800, 4810, E4900, 6800, and E6900 systems with ScApp firmware 5.19.8 or 5.20.3 or later (as delivered in patches 114526-09 and 114527-04)

Modification History
Date: 13-FEB-2006

13-Feb-2006:

Updated Impact, Contributing Factors and Resolution sections
State: Resolved

Date: 05-DEC-2006

05-Dec-2006:

Updated Contributing Factors and Resolution sections

References

<SUNPATCH: 114526-09>
<SUNPATCH: 114527-04>

Previously Published As
101819
Internal Comments

Bug(s) added per [email protected]

There are a number of problems which manifest as L2CheckError. The key to this one is:

1) The error is reported by the AR, not the SDC.

2) A reboot of one domain causes the other domain within the partition to error pause.

3) The condition persists until the SC is rebooted

The more common type of AR L2CheckError includes DiffErr and is usually coupled with SafariPortError--the parity error (SafariPortError with AdrPErr) is the cause of the problem.

Another similar AR L2CheckError is described in SunAlert 101857.

If possible, please collect the following data for the problem described in this SunAlert. As these steps involve engineering mode commands, you will need to open an escalation with PTS to get an engineering mode password and to review the commands to be executed.

First, script (capture) the output of:

1. The tty of the master SC

2. A telnet session to the platform shell

3. A telnet session to the domain shell for each affected domain

4. Any platform shell session you will be using

Then, make a note of and then disable any domain error recovery. We also set diag-level to quick to minimize the time spent in POST while we gather data:

for each domain:

setupdomain -p boot

diag-level = quick

reboot-on-error = false

hang-policy = notify

OBP.error-reset-recovery = none

showdomain (to confirm settings)

This will help to preserve the error condition and also prevent any ping-pong where the error recovery triggers another instance of the error in the adjacent domain.

With both domains in the affected partition up and running, gather the output of:

showboards -ev

showplatform -v

for each SB and RB:

dumpregs //sb1

dumpregs //sb3

etc...

dumpregs //rp0

dumpregs //rp1

etc...

for each IB:

dumpregs //ib6/ar0

dumpregs //ib6/sdc0

dumpregs //ib6/dx0

dumpregs //ib6/dx1

dumpregs //ib8/ar0

dumpregs //ib8/sdc0

dumpregs //ib8/dx0

dumpregs //ib8/dx1

etc...

nvci

showdate

history

Note that just using dumpregs //ibX will cause the Schizo ASIC on that IB to be scanned, and that will cause a domain failure. JTAG scan on a Schizo in an active domain remains a problem in Serengeti.

Now, start a trace of the Safari ASIC programming sequence. The messages produced will only appear on the involved tty/platform/domain shell; that is why we have scripted the output of all of the shells and the tty port.

showdate

print ConsoleComm.setDebugLevel(1)

showdate

(it is important to issue a showdate every once and a while so we can sort out all of the output later on)

Once the trace has started, issue a reboot in one of the affected domains in order to cause the error. Keep detailed notes about when the reboot was issues, on which domain, etc.

When the failure is occurs, re-issue the dumpregs to get a picture of the ASICs in the failed state. (This is why we turned off the domain error recovery earlier):

showdate

showboards -ev

showplatform -v

for each SB and RB:

dumpregs //sb1

dumpregs //sb3

etc...

dumpregs //rp0

dumpregs //rp1

etc...

for each IB:

dumpregs //ib6/ar0

dumpregs //ib6/sdc0

dumpregs //ib6/dx0

dumpregs //ib6/dx1

dumpregs //ib8/ar0

dumpregs //ib8/sdc0

dumpregs //ib8/dx0

dumpregs //ib8/dx1

etc...

nvci

showdate

history

Now, reboot the SC to correct the problem. This will also turn off tracing and take you out of engineering mode.

reboot (from the SC platform shell)

Bring the failed domain back up and attempt to recreate the problem.

Once both domains are up and stable, collect a final set of dumpregs data:

showdate

showboards -ev

showplatform -v

for each SB and RB:

dumpregs //sb1

dumpregs //sb3

etc...

dumpregs //rp0

dumpregs //rp1

etc...

for each IB:

dumpregs //ib6/ar0

dumpregs //ib6/sdc0

dumpregs //ib6/dx0

dumpregs //ib6/dx1

dumpregs //ib8/ar0

dumpregs //ib8/sdc0

dumpregs //ib8/dx0

dumpregs //ib8/dx1

etc...

nvci

showdate

history

Re-establish the customer's original domain recovery settings:

for each domain:

setupdomain -p boot

diag-level

reboot-on-error

hang-policy

OBP.error-reset-recovery

showdomain (to confirm settings)

Again, it is very important to script all of the ScApp shells and take notes throughout the data collection so that the persons analyzing the data can follow the sequence of events.

Internal Contributor/submitter
[email protected]

Internal Eng Business Unit Group
SSG ES (Enterprise Systems)

Internal Eng Responsible Engineer
[email protected]

Internal Services Knowledge Engineer
[email protected]

Internal Escalation ID
1-9647455, 1-10144987, 1-9568590, 1-10502561

Internal Resolution Patches
114526-09, 114527-04

Internal Sun Alert Kasp Legacy ID
101819

Internal Sun Alert & FAB Admin Info
Critical Category: Availability ==> Diagnosis
Significant Change Date: 2005-07-27, 2006-02-13
Avoidance: Patch, Workaround
Responsible Manager: [email protected]
Original Admin Info: [WF 05-Dec-2006, dave m: patch added, republish]
[WF 13-Feb-2006, Dave M: patch added, BugID removed per Mgmt and Tom Favara, rerelease]
[WF 27-Jul-2005, Dave M; correction to Synopsis]
[WF 27-Jul-2005, Dave M; corrections made per submitter; send for release]
{WF 26-Jul-2005, Dave M; sent for 24hr review]
[WF 25-Jul-2005, Dave M; corrected draft copy sent by submitter]
[WF 19-Jul-2005, Dave M; draft created]
Product_uuid
29d05214-0a18-11d6-92b2-a111614865b5|Sun Fire 3800 Server
29d3a694-0a18-11d6-92da-df959df44cdd|Sun Fire 4800 Server
29d6f808-0a18-11d6-8aa8-943929fbbdd8|Sun Fire 4810 Server
29da7938-0a18-11d6-8a41-9ed1ad6d6779|Sun Fire 6800 Server
4fe39727-0599-11d8-84cb-080020a9ed93|Sun Fire E6900 Server
bed24aa9-0598-11d8-84cb-080020a9ed93|Sun Fire E4900 Server

References

SUNPATCH:114526-09
SUNPATCH:114527-04

Attachments

This solution has no attachment