Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-77-1277973.1
Update Date:2011-01-03
Keywords:

Solution Type  Sun Alert Sure

Solution  1277973.1 :   Sun4v CMT Systems May Hang/Panic/Reset or Power Off as a Result of Handling Correctable or Retryable Events  


Related Items
  • Sun Netra T5440 Server
  •  
  • Sun SPARC Enterprise T5440 Server
  •  
  • Sun SPARC Enterprise T5240 Server
  •  
  • Sun Blade T6340 Server Module
  •  
  • Sun SPARC Enterprise T5140 Server
  •  
  • Sun Netra T6340 Server Module
  •  
Related Categories
  • GCS>Sun Microsystems>Sun Alert>Criteria Category>Availability
  •  
  • GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved
  •  




In this Document
  Description
  Likelihood of Occurrence
  Possible Symptoms
  Workaround or Resolution
  Modification History
  References


Applies to:

Sun SPARC Enterprise T5140 Server - Version: Not Applicable and later   [Release: N/A and later ]
Sun SPARC Enterprise T5240 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun SPARC Enterprise T5440 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Blade T6340 Server Module - Version: Not Applicable and later    [Release: N/A and later]
Sun Netra T5440 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun SPARC Sun OS
________________________________

Date of Resolved Release: 01-Jan-2011

________________________________

Description


Multi-socket Sun4v CMT systems, when handling/processing fault events, may under certain conditions exhibit a loss of system availablilty, resulting in a system reset/panic or possibly a system hang. This occurs when handling events that require data from a remote CPU node - a fault condition is triggered when using an uninitialised register.

Likelihood of Occurrence


This issue can occur on the following platforms:

Sun4v CMT Systems:
  • Sun SPARC Enterprise T5140, T5240, T5440
  • Sun Blade T6340
  • Sun Netra T6340, Netra T5440
when the above systems are running system firmware 7.2.10.a and earlier.

Notes:

1. No other Blade, Enterprise, or Netra systems are affected by this issue.

2. There is no specific set of conditions likely to trigger this issue, nor any method of predicting when or how frequently this issue may occur. The risk of seeing this issue is regarded as low, but the impact to system availability is high since an outage may occur.

To determine the firmware version on the system, run the following commands from the ILOM:
-> show HOST

/HOST
 Targets:
     bootmode
     diag
     domain

Properties:
   autorestart = reset
   autorunonerror = false
   bootfailrecovery = poweroff
   bootrestart = none
   boottimeout = 0
   hypervisor_version = Hypervisor 1.7.2.b 2009/07/17 09:35
   macaddress = 00:14:4f:ef:1b:c4
   maxbootfail = 3
   obp_version = OBP 4.30.2.b 2009/06/16 07:02
   post_version = POST 4.30.2 2009/04/21 09:57
   send_break_action = (none)
   status = Solaris running
   sysfw_version = Sun System Firmware 7.2.2.g 2009/07/17 10:34  <<<<<

Commands:
   cd
   set
   show
->
or:
sc> showhost
Sun System Firmware 7.2.7.b 2010/01/07 17:56

Host flash versions:
   Hypervisor 1.7.6 2009/12/01 14:30
   OBP 4.30.6 2009/12/01 12:41
   POST 4.30.6 2009/12/01 13:18
sc>

Possible Symptoms


When initialising a register as a result of reading data from a remote node, the system may experience a loss of availability that may be observed on multi-socket CMT systems and will take the form of the following:

- A Solaris panic, such as one of the following:
panic: send_mondo timeout
panic: unrecoverable hardware error
- Hypervisor Abort (HVabort) followed by a system power off
- Red State Exception and system reset
- Watchdog Exception leading to a system reset
- System Hang

Workaround or Resolution


To resolve this issue, upgrade system firmware to 7.3.0 (or above), using the appropriate patch listed below:  
  • Sun SPARC Enterprise T5140/T5240 patch 145676-01 or later
  • Sun SPARC Enterprise T5440 patch 145678-01 or later
  • Netra T5440 patch patch 145677-01 or later
  • Sun Blade T6340 patch 145679-01 or later
  • Sun Netra T6340 patch 145680-01 or later
Note: Although the likelihood of experiencing this issue is low, upgrading to firmware 7.3.0 (or later) is recommended as soon as possible when your schedule allows.

Modification History

Date of Resolved Release: 01-Jan-2011

Internal Comments:
Please send technical questions to the following email:
[email protected]
and copy the Responsible Engineer and Knowledge Analyst

6983478 - Multi-node systems crashing after CE due to incorrect rerouting code

CR 6983478 relates to a coding issue within Hypervisor where an uninitialised
register is being used to dereference the @internal error table entry on reading
data from a remote node - such as Error Status Registers. As the register is
uninitialised, when the register is dereferenced, a number of issues can result
- what is often seen is either a system panic/@hang/reset or hvabort. This issue
only applies to multisocket CMT systems, and it has been seen by a number of
customers to date.

For more indepth detail on this issue, please review the CR referenced above.

Internal Contributor/Submitter:  [email protected], [email protected]
Internal Eng Responsible Engineer:  [email protected]
Internal Services Knowledge Analyst:  [email protected]
Internal Eng Business Unit Group:  Systems Group-SVS (SPARC Volume Systems,
Horizontal Systems(includes T2000/Ontario)
Internal Escalation ID: 2-8213384

References

<SUNPATCH:145676-01>
<SUNPATCH:145677-01>
<SUNPATCH:145678-01>
<SUNPATCH:145679-01>
<SUNPATCH:145680-01>
<SUNBUG:6983478>

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback