Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1018813.1
Update Date:2010-06-15
Keywords:

Solution Type  Problem Resolution Sure

Solution  1018813.1 :   Sun Fire[TM] 3800-6800: Domains running firmware 5.15.x or later with hang-policy set to "notify" may lose critical troubleshooting data  


Related Items
  • Sun Fire 3800 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire 4810 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  

PreviouslyPublishedAs
230603


Applies to:

Sun Fire 3800 Server
Sun Fire 4800 Server
Sun Fire 4810 Server
Sun Fire 6800 Server
All Platforms

Symptoms

Symptoms
Starting with firmware level 5.15.0, ScApp detects and, depending on the
setting of the domain hang-policy variable, can attempt to reset a hung domain.
Systems initially installed with 5.15.0 or later will have the hang-policy
default to "reset", which will attempt to reset a hung domain.

The hang-policy variable was also present in earlier firmware versions.
However, systems that were initially installed with an earlier firmware version
will have the hang-policy set to "notify" by default. When these systems are
upgraded to 5.15.0 or later, the current value of hang-policy,
and all other existing domain and platform settings are left intact. This will
cause two issues.

First, the SC will not attempt to automatically reset a domain with hang-
policy=notify, negating the effects of this new feature in ScApp.

Second, and possibly more importantly, the new features in 5.15.x will cause
the SC to log that it noticed the hung domain. It will log this notice each time
it polls the domain to determine if it is active. The SC will log this notice
both on the loghost server, and in its internal log buffers, which are used to
display data via the showlogs command. This internal buffer is circular - as a
new entry is made, it removes the oldest entry still present in the buffer. The
end result is that a domain hang with hang-policy set to notify will overflow
the circular buffer and eliminate any useful data from "showlogs -d x" that
would indicate the initial condition that caused the hang. An example of these
messages:
...
Aug 09 07:55:12 sunfire-sc0 Domain-C.SC: [ID 180731 local0.notice] Domain C is
active again
Aug 09 07:55:12 sunfire-sc0 Domain-C.SC: [ID 690470 local0.error] Domain
watchdog timer expired.
Aug 09 07:55:12 sunfire-sc0 Domain-C.SC: [ID 398807 local0.notice]
hang-policy is NOTIFY. Not resetting domain.
Aug 09 07:55:13 sunfire-sc0 Domain-C.SC: [ID 180731 local0.notice] Domain C is
active again
Aug 09 07:55:13 sunfire-sc0 Domain-C.SC: [ID 690470 local0.error] Domain
watchdog timer expired.
Aug 09 07:55:13 sunfire-sc0 Domain-C.SC: [ID 398807 local0.notice]
hang-policy is NOTIFY. Not resetting domain.
...

If there is not a working loghost configured for the domain, the failure
cannot be troubleshot.

Changes

{CHANGE}

Cause

{CAUSE}

Solution


Resolution
Use the "setupdomain" command to set hang-policy to "reset" on all platforms that are upgraded to 5.15.x or later. Always configure a working loghost for all Sun Fire 3800/4800/4810/6800 platforms and domains.

Product
Sun Fire 6800 Server
Sun Fire 4810 Server
Sun Fire 4800 Server
Sun Fire 3800 Server

Internal Comments
Authored by David Re, PTS-AMER-MSG.

Bug 4906714 has been submitted re: hang-policy=notify behavior rolling the
SC logs.

An RFE may be forthcoming to have hang-policy on each domain changed to
reset upon the first upgrade to a FW level >= 5.15.x.

hang-policy, showlogs, notify, reset
Previously Published As
71216




Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback