Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1009222.1
Update Date:2011-05-30
Keywords:

Solution Type  Technical Instruction Sure

Solution  1009222.1 :   Sun Fire[TM] 15K/12K Servers: setkeyswitch ops report "[5358] Transmission or pcd(1M) handling of domain-down event failed: ecode=1711"  


Related Items
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
212762


Applies to:

Sun Fire 15K Server
All Platforms

Goal

The SMS (Short Message Service) CLI (command-line interface), setkeyswitch(1M) changes the position of the virtual keyswitch to the specified value. setkeyswitch is responsible for powering on or powering off boards and bringing up a domain.

This short article documents the trouble-shooting process generally undertaken to isolate the root-cause behind failed setkeyswitch operations that report the error msg "[5358] Transmission or pcd(1M) handling of domain-down event failed: ecode=1711"

Solution

The earliest symptoms manifested via this error condition vary; for example, a Domain OS initiated Solaris[TM] reboot/init 6 operations appears to be in a hung condition. Under such circumstances, the SMS domain logs would document the following log messages:
Jan 26 11:28:54 2005 v4u-15ka-sc1 dsmd[2491]-R(): [2536 4842682682472752 NOTICE DomainsPatrol.cc 724] Reset domain R request received, restarting domain.
Jan 26 11:32:08 2005 v4u-15ka-sc1 dsmd[2491]-R(): [5304 4842876560970785 ERR SysControl.cc 1853] Domain failed by hpost: ecode=44
Jan 26 11:32:23 2005 v4u-15ka-sc1 dsmd[2491]-R(): [2507 4842891980127738 ERR Observers.cc 102] Failed to send DSMD_EVENT_DOMAIN_STOP 17 event to PCD, rc = 1711.


Upon further investigations into domain Rs' POST log dir (/var/opt/SUNWSMS/SMS1.4.1/adm/R/post) would report the following anomaly:
-rw-rw-rw-    1        sms-dsmd  sms      0       Jan 26 11:29 post050126.1129.00.log

--> empty POST logs captured off the above domain reboot event & the domain's status would remain hung at "In Recovery":
R       v4u-15ka-r    -      In Recovery


In addition, all attempts at recovering the domain's operations via the setkeyswitch CLI would yield the ecode=1711 error message:
v4u-15ka-sc1:sms-svc:40> setkeyswitch -d R standby
Current virtual key switch position is "ON".
Are you sure you want to change to the "STANDBY" position (yes/no)? yes
Domain is up.
Sending domain shutdown request.
Domain failed to pick up shutdown request.
You can abort or force a shutdown.
Do you want to force a shutdown (yes/no)? yes
[5358] Transmission or pcd(1M) handling of domain-down event failed: ecode=1711


Looking through the SMS platform logs through the same time period yielded the following log extracts :
Jan 26 11:32:09 2005 v4u-15ka-sc1 ssd[724]: [1310 4842877760286195 NOTICE StartupManager.cc 3239] software component shutdown successful: name=dxs-R
Jan 26 11:32:09 2005 v4u-15ka-sc1 pcd[2476]: [1754 4842878088665310 ERR PCDApp.cc 2533] PCD chkpt WRITE failed. session id: 128, status: 8
Jan 26 11:32:09 2005 v4u-15ka-sc1 pcd[2476]: [1764 4842878089336030 ERR PCDApp.cc 1532] PCD unable to checkpoint Domain Down event sequence
Jan 26 11:32:09 2005 v4u-15ka-sc1 pcd[2476]: [1711 4842878090954397 ERR DomainMgr.cc 356] Unable to write file: /var/opt/SUNWSMS/SMS1.4.1/.pcd/domain_info.tmp with errno = 28
Jan 26 11:32:09 2005 v4u-15ka-sc1 pcd[2476]: [1711 4842878092301630 ERR BoardMgr.cc 372] Unable to write file: /var/opt/SUNWSMS/SMS1.4.1/.pcd/sysboard_info.tmp with errno = 28


As observed from the above SMS platform log extracts, the following conditions are recognized:
  1. The DXS daemon from SMS, which is responsible for providing virtual console functionality, dynamic reconfiguration mailbox support, and PCI mailbox software support to its resident domains had actually shutdown shortly after the domain OS reboot was initiated
  2. The SMS PCD service was not able to successfully acquire the current board list from its database & setup checkpointing information. In addition, it had also flagged a ENOSPC (errno 28) against its attempt to access parts of the PCD database (domain_info & sysboard_info)

Given the above findings, one can reinforce the conclusions reached off the above observations via looking through the current contents of the PCD's database repository:
# ls -l /var/opt/SUNWSMS/SMS1.4.1/.pcd/
-rw------- 1 sms-pcd sms    3557    Jan 24 02:25 domain_info
-rw------- 1 sms-pcd sms       0    Jan 26 13:03 domain_info.tmp
-rw------- 1 sms-pcd sms     171    Oct 15 07:34 platform_info
-rw------- 1 sms-pcd sms    1425    Jan 10 03:42 sysboard_info
-rw------- 1 sms-pcd sms       0    Jan 26 13:03 sysboard_info.tmp


As observed from the above, the 2 temporary files that PCD have setup to initiate checkpointing ops are actually non-populated (empty).
Hence, given the above data presented and the fact that PCD had flagged a ENOSPC (errno 28) against its attempt to access parts of the PCD database, we can reasonably assume that the root-cause surrounds the issue of making available sufficient disk space to accommodate the two critical elements of SMS facilitating the Solaris reboot event:
  1. Sufficient disk space to accommodate the POST log (hpost -Q ops) generated as a result of the Solaris reboot event;
  2. Sufficient disk space to accommodate PCD's write checkpointing information (of the Domain Down event sequence) to its PCD database repository.

The error condition was finally isolated to the following disk full condition at the root file system:
v4u-15ka-sc1:sms-svc:63> df -k
Filesystem         kbytes      used      avail      capacity        Mounted on
/dev/md/dsk/d10   6050182   5989723          0          100%        /
/proc                   0         0          0            0%        /proc
fd                      0         0          0            0%        /dev/fd
mnttab                  0         0          0            0%        /etc/mnttab
swap              1962016        16    1962000            1%        /var/run
swap              1962104       104    1962000            1%        /tmp
/dev/md/dsk/d30   3231203   2043615    1155276           64%        /export/install


Final redress will entail free'ing up sufficient disk space to accommodate normal SMS operations managing & monitoring its resident domains.

Product
System Management Services 1.4.1 Software and above
Sun Fire 12K/15K/20K/25K

Internal section

Keywords: starcat, hpost, pcd 1711, sysboard_info.tmp domain_info.tmp, checkpoint, chkpt, hang, reboot, hpost ecode=44, DSMD_EVENT_DOMAIN_STOP

Previously Published As 80097



Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback