Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1012043.1
Update Date:2010-07-06
Keywords:

Solution Type  Technical Instruction Sure

Solution  1012043.1 :   Processor may be Incorrectly Offlined When it Encounters a UCC + ME bit set in AFSR  


Related Items
  • Sun Fire E6900 Server
  •  
  • Sun Fire V440 Server
  •  
  • Sun Fire V480 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Netra 1280 Server
  •  
  • Sun Fire V880 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Netra 440 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire 15K Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire 4810 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  
  • GCS>Sun Microsystems>Servers>Entry-Level Servers
  •  
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  

PreviouslyPublishedAs
216501


Description
UCC event with the Multiple Error(ME) bit set in the Asynchronous Fault Status Register (AFSR)

A single UCC event with the Multiple Error(ME) bit set in the Asynchronous Fault Status Register(AFSR), causes the reporting processor to be offlined.  However, if this is a single event, it can be treated as a single bit flip on one SRAM.  No hardware replacement is recommended. (ref BugID# 4875077).

Susceptible versions of Solaris[TM] Operating System: 8 & 9.  BugID# 4875077 has been fixed in the following Solaris OS patch releases, and all later revisions:

 Solaris[TM] 8 OS PatchID# <SUNPATCH: 108528-29> + <SUNPATCH: 117000-01>.
Solaris[TM] 9 OS PatchID# <SUNPATCH: 112233-12>.


Steps to Follow
Resolution
From the example logs below:
  1. Note errID 0x00002ca7.3a89e5f0 (see blurb on "errID" at the end of the document).
  2. The two errors logged are essentially the same event. The "First Error UCC Event" is reported and subsequently, another UCC event with ME bit got reported. These events are essentially the same event, since we have the same errID.

Per BugID# 4740769 entitled "US-III cpus should be offlined after multiple correctable E$ ECC events", for UCC+ME bits set in AFSR :

"The special combination of a UCC event with the ME bit set is treated as if three distinct CEs as described above had occurred in very rapid succession. The SERD algorithm is short-circuited and the processor immediately becomes a candidate for offlining. Careful checking is done to make certain that the ME could have been the result only of the UCC and not of any other event."

However, due to the nature of the trap handling involved, a single UCC event can be detected in the Solaris[TM] OS as a UCC+ME - BugID# 4875077 describes this behavior. What we see in the logs, is a bug and should be treated as a case of a single bit flip on the SRAM chip.

No hardware replacement is recommended.

 Sep 21 04:26:18 loneqresdbp1 SUNW,UltraSPARC-III+: [ID 322949 kern.info]
NOTICE: [AFT0] First Error UCC Event detected by CPU10 in User mode at
TL=0, errID 0x00002ca7.3a89e5f0
^^^^^^^^^^^^^^^^^^^^^^^^^^
Sep 21 04:26:18 mydomain AFSR 0x00000400.000001ca AFAR 0x00000000.f1141a10
Sep 21 04:26:18 mydomain Fault_PC 0x141a08 Esynd 0x01ca /N0/SB2/P2/E1 J6300
Sep 21 04:26:18 mydomain SUNW,UltraSPARC-III+: [ID 455778 kern.info]
[AFT0] errID 0x00002ca7.3a89e5f0 Data Bit
76 was in error and corrected
Sep 21 04:26:18 mydomain SUNW,UltraSPARC-III+: [ID 408847 kern.info]
[AFT2] errID 0x00002ca7.3a89e5f0 PA=0x00000000.f1141a00
Sep 21 04:26:18 mydomain E$tag 0x00000003.c4001249 E$state_0 Shared
Sep 21 04:26:18 mydomain SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x00) 0x10bfffd5.030063b7 0x9de3bfa0.80a62000 ECC 0x04b
Sep 21 04:26:18 mydomain SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x10) 0x02800034.01000000 0xd0062024.80a22000 ECC 0x172
Sep 21 04:26:18 mydomain SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0x02800032.030063b7 0xd0062014.80a22000 ECC 0x0f5
Sep 21 04:26:18 loneqresdbp1 SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0x22800004.90102000 0xd0062014.90222070 ECC 0x183
Sep 21 04:26:18 mydomain SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Sep 21 04:26:18 mydomain SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 173316 kern.info]
NOTICE: [AFT0] UCC Event detected by CPU10 in User mode at TL=0,
errID x00002ca7.3a89e5f0
^^^^^^^^^^^^^^^^^^^^^^^^
Sep 21 04:26:30 mydomain AFSR 0x00200400.000001ca AFAR 0x00000000.f1141a10
Sep 21 04:26:30 mydomain Fault_PC 0x141a08 Esynd 0x01ca /N0/SB2/P2/E1 J6300
Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 455778 kern.info]
[AFT0] errID 0x00002ca7.3a89e5f0 Data Bit 76 was in error and corrected
Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 408847 kern.info]
[AFT2] errID 0x00002ca7.3a89e5f0 PA=0x00000000.f1141a00
Sep 21 04:26:30 mydomain E$tag 0x00000003.c4001249 E$state_0 Shared
Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x00) 0x10bfffd5.030063b7 0x9de3bfa0.80a62000 ECC 0x04b
Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x10) 0x02800034.01000000 0xd0062024.80a22000 ECC 0x172
Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0x02800032.030063b7 0xd0062014.80a22000 ECC 0x0f5
Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0x22800004.90102000 0xd0062014.90222070 ECC 0x183
Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Sep 21 04:26:30 mydomain SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available
Sep 21 04:26:31 mydomain SUNW,UltraSPARC-III+: [ID 489146 kern.notice]
NOTICE: [AFT1] CPU10 offlined due to UCC Event with ME set

Blurb on ErrID :

errID is the value of %stick register of CPU, which gets read by the high resolution timer function called hirestime in Solaris, so events are coincident with time. It attempts to uniquely identify errors by attaching them to this high resolution timer value. It should be noted, however, that due to the nature of traps, and the order in which they are handled and reported, errID is not always useful to chronologically order errors.



Product
Netra 1280 Server
Netra 1290 Server
Sun Fire V1280 Server
Sun Fire 6800 Server
Sun Fire 4810 Server
Sun Fire 4800 Server
Sun Fire 3800 Server
Sun Fire 15K Server
Sun Fire 12K Server
Sun Fire V480 Server
Sun Fire V440 Server
Sun Fire V880 Server
Netra 440 Server
Sun Fire E4900 Server
Sun Fire E6900 Server
Sun Fire E2900 Server

Internal Comments
BugID #'s:

4875077 () , 4740769 ()


Note that bug 4875077 details a potential workaround - disable offlining of

CPUs from correctable ECC events in the L2. There is a risk associated with

this - while we won't improperly offline a working CPU that took a single

correctable ECC event from it's L2 SRAM, we also won't offline a CPU that

is taking *any* correctable ECC events from L2 SRAM, regardless of how

many. If the customer needs a workaround, this can be used, but be certain

to explain the caveats to them!


BugID# 4875077 has been fixed as of the following Solaris patch releases and all later revisions:


Solaris[TM] 8 PatchID# 108528-29 () + 117000-01 () . 117000-01 was released on Mar. 26 2004.

Solaris[TM] 9 PatchID# 112233-12 (). Patch 112233-12 was officially released on May 13, 2004.



FYI only: Though this bug only affects SPARC III cpus, the bug fix code was incorporated in the following releases.

Solaris[TM] 8 x86, PatchID 108529-29 + 117001-01. 117001-01 was released on March 26, 2004.

Solaris[TM] 9 x86, PatchID# 112234-12. Patch 112234-12 was released April 19, 2004


UCC, ME, ECC, UCC+ME
Previously Published As
72159

Change History
Date: 2009-12-02
User Name: Josh Freeman
Action: Refreshed
Comment: Format changes to the document and that is it. ESG Content Team update...
Date: 2006-01-17
User Name: 18392
Action: Update Canceled
Comment: *** Restored Published Content *** SSH Audit

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback