Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1004712.1
Update Date:2009-12-03
Keywords:

Solution Type  Problem Resolution Sure

Solution  1004712.1 :   Sun Fire[TM] 12K/15K/E20K/E25K: Domain reboot hangs at "resetting..." and does not run HPOST  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
206542


Symptoms
A Sun Fire[TM] 12K/15K/E20K/E25K domain is issued the command reboot, init 6, or a "halt + boot" and it goes down to OBP. It starts the process of "Resetting" where it sits indefinitely. No post log is generated and no HPOST process is ever run for this domain. A manual setkeyswitch off and on for the domain finally brings it back up with no issues. Here is an example of this issue taken from a real case:

May 19 22:06:38 2003 # init 0
May 19 22:06:43 2003
May 19 22:06:44 2003 INIT: New run level: 0
May 19 22:06:44 2003 The system is coming down.  Please wait.
May 19 22:06:44 2003 System services are now being stopped.
May 19 22:06:54 2003 Print services already stopped.
May 19 22:07:26 2003 The system is down.
May 19 22:07:56 2003 syncing file systems... done
May 19 22:08:03 2003 Program terminated
May 19 22:08:43 2003 {2} ok boot
May 19 22:08:43 2003 Resetting...
May 19 22:53:12 2003
May 19 22:53:12 2003 @(#)OBP 4.5.20 2003/02/13 18:08 Sun Fire 15000
May 19 22:53:12 2003 IOSRAM based Console initialized
May 19 22:53:12 2003 Probing Pseudo NVRAM device
The customer entered the init 0 and then issued the boot command.  The
domain sat at "Resetting" for almost 45 minutes before the customer finally
intervened with the setkeyswitch off and on to restore the domain to the
OS.  Afterward, the customer issued a reboot to see if the issue was only
with "init 0 + boot".  It was not; the reboot also hung at "Resetting...".


Resolution
Here is an example taken from the same customer case during a successful boot cycle:
May  8 15:40:47 2003 rebooting...
May  8 15:40:47 2003 Resetting...
May  8 15:47:49 2003
May  8 15:48:02 2003
May  8 15:48:02 2003
May  8 15:48:02 2003 Sun Fire 15000, using IOSRAM based Console
May  8 15:48:03 2003 Copyright 1998-2002 Sun Microsystems, Inc. All rights
reserved.
May  8 15:48:03 2003 OpenBoot 4.5, 94208 MB memory installed, Serial #44593284.
May  8 15:48:03 2003 Ethernet address 0:0:be:a8:70:84, Host ID: 82a87084.
May  8 15:48:03 2003
May  8 15:48:03 2003
May  8 15:48:03 2003
May  8 15:48:04 2003 Rebooting with command: boot
May  8 15:48:04 2003
May  8 15:48:05 2003 Boot device: /pci@1c,600000/pci@1/scsi@2/disk@0,0:a
File and args: /
You can see that during the successful boot cycle for this domain, it takes
about seven minutes between the message "Resetting" and the OBP banner
(which is an indication that hpost has completed on the domain).
When a domain does a reset at OBP it is supposed to be executing HPOST on
the domain components.  An hpost process should exist on the SC, and if the
domain were rebooted, the hpost process would show a -Q option being passed
to it (Quick POST).
-------------------
Below SMS 1.3
-------------------
A hang at the "Resetting..." stage might be the result of domain_asr
(domain Automatic System Recover) being disabled, if the SMS version is
below SMS 1.3.  Domain ASR can be disabled in the dsmd_tuning.txt file
located in the /etc/opt/SUNWSMS/SMS1.X/config directory on the system
controller.
The dsmd_tuning file is the Domain Status Monitoring Daemon's configuration
file.  Basically, it is this file which tells dsmd on the
system controller how it should function and control the platform's
domains.  The setting for domain_asr is shown towards the bottom of the file.
From /etc/opt/SUNWSMS/SMS1.X/config/dsmd_tuning.txt:
--------------------------------------------------------------------
** The default monitoring controls are on.
*  To turn off all domains state monitoring, change domain_mon to 0.
*  To turn off all domains recovery actions, change domain_asr to 0.
*
domain_mon = 1
domain_asr = 1
--------------------------------------------------------------------
If "domain_asr = 0" and you are running a version of SMS older than 1.3,
this is the problem with why the "Resetting..." is hanging during normal
reboot or boot up operations.
********************************************************************
NOTE:  Each domain can also have it's own dsmd_tuning.txt file which
controls how dsmd behaves only for that specific domain.  The
domain specific dsmd_tuning.txt file would be in the domain
configuration directory, /etc/opt/SUNWSMS/config/<A-R>.  Make
sure domain_asr is not disabled here either.
********************************************************************
Domain ASR should be re-enabled by changing "domain_asr = 0" in the correct
dsmd_tuning.txt files and then restart dsmd to re-read it's configuration
file.  Dsmd is best restarted by stopping and starting SMS, but first make
sure that failover is off and no platform configuration changes are
occurring when you do the stop and start of SMS.  Make the changes to both
SCs so that the configuration of dsmd is the same regardless of which SC is
the MAIN.
-------------------
SMS 1.3 and Above
-------------------
Bug ID 4658538, introduced in SMS 1.3 now allows a domain to reboot properly
regardless of the domain_asr setting.  So, if this behavior is encountered
and the SMS version is 1.3 or higher, the issue is something else.
The most likely cause of this behavior on SMS 1.3 and above is a permission
problem on those files responsible for configuring HPOST on the platform or
domain.  If it is a permissions problem, you should expect to see the
domain reboot, go down to OBP, and appear to hang at "Resetting..." as
described above.  With a permissions problem, hpost will execute on the
domain and post logs should be created.  The post logs (in
/var/opt/SUNWSMS/SMS/adm/<A-R>/post), however, should show an error like
the following:
# Cmdline:  /opt/SUNWSMS/SMS1.3/bin/hpost -d B -Q
Unable to open .postrc file /etc/opt/SUNWSMS/config/B/.postrc
Permission denied
Errors in .postrc file. Bailing out!
As that message clearly indicates, hpost can not read the .postrc file in
question, so the domain remains at "Resetting..." trying to execute HPOST
on the domain.  Ultimately, a setkeyswitch off and on is executed and the
domain posts just fine, and then boots back up.
When a domain is rebooted, the sms-dsmd user is responsible for executing
HPOST on the domain.  When a domain is keyswitched on/off it is the sms-svc
user (or d omain specific user if using ACL - Access Control Lists).  These
different users both must have access to the configuration files for HPOST
in order to properly recover a domain if necessary.
The .postrc files and blacklist files used in HPOST need to be world
readable (644) regardless of the owner of the file.  If world readable,
both sms-svc and sms-dsmd can read and configure a domain properly at this
"Resetting..." stage of OBP.


Relief/Workaround
If the reboot which started this issue is a result of a cron job, or panic on the weekend or overnight when people aren't around, this hang at "Resetting..." may last for long periods of time until manual intervention can bring it back up.
The basic warning here is disable asr only when instructed to do so by Sun support, but know the risks of doing so, if operating less than SMS 1.3. This issue also stresses the importance of being sure the HPOST configuration files have the correct permissions to avoid such lengthy downtime, regardless of SMS version.  These seemingly trivial changes could result in a domain remaining down for extended periods of time as the result of something so basic as a reboot.


Additional Information
Th Problem Resolution below provides full details on HPOST configuration file permissions information.
<Document: 1010600.1> Sun Fire[TM] 12K/15K/E20K/E25K: "Domain failed by hpost: ecode=39"

Product
Sun Fire 15K Server
Sun Fire 12K Server
Sun Fire E25K Server

Internal Comments
For internal Sun use only.

See Problem Resolution < Document: 1004778.1 > for details on why domain_asr might be disabled.

Bug ID 4521655 was filed for the domain_asr behavior. Just know that this
"Resetting..." hang isn't a bug. This is how asr worked prior to SMS 1.3.

Bug ID 4658538 allowed asr to be disabled and still allow for domain recovery
through a reboot.
starcat, 12k, 15k, resetting, rebooting, boot, hang, asr, dsmd, dsmd_tuning
Previously Published As
70064

Change History
Reviewed by ESG Content Team on Nov 24, 2009
Date: 2006-01-22
User Name: 18392
Action: Update Canceled
Comment: *** Restored Published Content *** SSH Audit

Date: 2006-01-22
User Name: 18392
Action: Update Started
Comment: SSH Audit
Version: 0

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback