Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1007790.1
Update Date:2009-09-01
Keywords:

Solution Type  Problem Resolution Sure

Solution  1007790.1 :   Sun Fire[TM] 12K/15K/E20K/E25K: System Controller (SC) platform messages file reports "FRAD chkpt WRITE failed. session id: 128, return code: 8" errors.  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
210776


Symptoms
In the /var/opt/SUNWSMS/adm/platform/messages file on a Sun Fire[TM] 12K/15K/E20K/E25K SC, all components in the platform report FruAcess errors and "FRAD chkpt WRITE failed" messages such as the following:

 Oct  4 07:00:43 2004 sc0 frad[660]: [10009 1379176584473204 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id:   128, return code: 8
Oct  4 07:00:43 2004 sc0 esmd[1422]: [1994 1379176638578801 ERR FruAccess.cc 554] Failed to update the power summary record of   fru FT5: rc=-2
Oct  4 07:00:43 2004 sc0 esmd[1422]: [1994 1379176639358700 ERR DynamicFru.cc 256] Failed to update the power summary record   of fru FT5: rc=-2
Oct  4 07:00:43 2004 sc0 frad[660]: [10009 1379176775710362 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id:   128, return code: 8
Oct  4 07:00:43 2004 sc0 esmd[1422]: [1991 1379176829667639 ERR FruAccess.cc 473] Failed to write the power event record of fru   FT5: rc=-2
Oct  4 07:00:43 2004 sc0 esmd[1422]: [1992 1379176830622863 ERR DynamicFru.cc 394] Failed to write the power event record,   STILL_ON, of fru FT5: rc=-2
Oct  4 07:03:43 2004 sc0 frad[660]: [10009 1379357012147050 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id:   128, return code: 8
Oct  4 07:03:43 2004 sc0 esmd[1422]: [1994 1379357084158464 ERR FruAccess.cc 554] Failed to update the power summary record of   fru SB14: rc=-2
Oct  4 07:03:43 2004 sc0 esmd[1422]: [1994 1379357085045960 ERR DynamicFru.cc 256] Failed to update the power summary record   of fru SB14: rc=-2
Oct  4 07:03:43 2004 sc0 frad[660]: [10009 1379357173410163 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id:   128, return code: 8
Oct  4 07:03:43 2004 sc0 esmd[1422]: [1993 1379357221559279 ERR FruAccess.cc 655] Failed to update the temperature summary   record of fru SB14(sensor=0): rc=-2
Oct  4 07:03:43 2004 sc0 esmd[1422]: [1993 1379357222339133 ERR DynamicFru.cc 210] Failed to update the temperature summary   record of fru SB14(sensor=0): rc=-2
Oct  4 07:03:43 2004 sc0 frad[660]: [10009 1379357334767963 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id:   128, return code: 8
Oct  4 07:03:43 2004 sc0 esmd[1422]: [1993 1379357388884910 ERR FruAccess.cc 655] Failed to update the temperature summary   record of fru SB14(sensor=1): rc=-2
Oct  4 07:03:43 2004 sc0 esmd[1422]: [1993 1379357389675546 ERR DynamicFru.cc 210] Failed to update the temperature summary   record of fru SB14(sensor=1): rc=-2
Oct  4 07:03:43 2004 sc0 frad[660]: [10009 1379357502066976 ERR FRADFailoverService.cc 237] FRAD chkpt WRITE failed. session id:   128, return code: 8

The command showenvironment reports all temperature and voltage status checks are fine for all components, and nothing appears to be wrong on the platform, so why are all the messages occurring and how do we stop them?



Resolution
The error message is indicating that FRAD, Fru Access Daemon, can not write to a checkpoint file.
Q: Why can't a daemon, or a user for that matter, write to certain files?
A: Because the daemon or user doesn't have permissions to the file.

In the case of the FRAD chkpt error, the file in question is located in the /var/opt/SUNWSMS/data/.failover/chkpt directory on the SC. This file is a checkpoint file that is used as reference by FOMD (Failover Monitoring Daemon) for file propagation between SCs.

If the permissions on this chkpt file are incorrect, the SMS daemon can not write to it and the error messages appear. So, a possible "fix" for this issue would be to simply open up the permissions on this file or directory and the daemons could now write to the chkpt file, as root does:

    chmod -R 777 /var/opt/SUNWSMS/data/.failover/chkpt

BUT, this is not really a good solution because this may not actually be the real root cause. There might be more problems that need to be resolved.

If the directory /var/opt/SUNWSMS/SMS1.4.1/data/.failover has the wrong group/ownership permissions, it's subdirectories are not writeable by sms daemons, and the error messages above will happen.

Changing just the permissions on the chkpt files or chkpt directory is not the correct course of action, because we need to make sure that the parent directory is not actually the real root cause. The whole directory structure needs it's ownership configuration resolved to head off possible future issues:

BAD CONFIGURATION (NOTE: ".cod" and ".failover" directories should be root:sms)

    sms-svc> cd /var/opt/SUNWSMS/SMS1.4.1/data/
sms-svc> ls -la
total 54
drwxrwxr-x+ 23 root     sms          512 Oct  4 14:15 .
drwxr-xr-x+  8 root     sys          512 Oct  2 00:52 ..
drwxrwxr-x   2 root     bin          512 Jun 18  2002 .cod
drwxrwxr-x   6 root     bin          512 Jun 18  2002 .failover
-r--------   1 root     sys           17 Sep 16 17:46 .remotesc
drwxr-xr-x   2 root     sms          512 Oct  2 01:55 .wcapp
drwxrwx---+  2 root     sms          512 Oct  2 01:55 A
drwxrwx---   2 root     sms          512 Sep 12 02:45 B
drwxrwx---   2 root     sms          512 Sep 12 02:45 C
drwxrwx---   2 root     sms          512 Sep 12 02:45 D
drwxrwx---   2 root     sms          512 Sep 12 02:45 E
drwxrwx---   2 root     sms          512 Sep 12 02:45 F
drwxrwx---   2 root     sms          512 Sep 12 02:45 G
drwxrwx---   2 root     sms          512 Sep 12 02:45 H
drwxrwx---   2 root     sms          512 Sep 12 02:45 I
drwxrwx---   2 root     sms          512 Sep 12 02:45 J
drwxrwx---   2 root     sms          512 Sep 12 02:45 K
drwxrwx---   2 root     sms          512 Sep 12 02:46 L
drwxrwx---   2 root     sms          512 Sep 12 02:46 M
drwxrwx---   2 root     sms          512 Sep 12 02:46 N
drwxrwx---   2 root     sms          512 Sep 12 02:46 O
drwxrwx---   2 root     sms          512 Sep 12 02:46 P
drwxrwx---   2 root     sms          512 Sep 12 02:46 Q
drwxrwx---   2 root     sms          512 Sep 12 02:46 R
-rw-r-----   1 sms-dsmd sms          288 Oct  2 02:04 dsmd_domain_info
srwxrwxrwx   1 sms-efe  sms            0 Oct  2 02:03 efeSock
-rw-r--r--   1 sms-osd  sms           72 Oct  2 00:11 osdTimeDeltas
-rw-r--r--   1 root     root           4 Oct  2 01:52 ssd_loop.pid

GOOD CONFIGURATION

    sms-svc> pwd
/var/opt/SUNWSMS/SMS1.4.1/data
sms-svc> ls -la
total 54
drwxrwxr-x+ 23 root     sms          512 Oct  2 16:30 .
drwxr-xr-x+  8 root     sys          512 Sep 22 11:47 ..
drwxrwxr-x   2 root     sms          512 Sep 22 11:51 .cod
drwxrwxr-x   6 root     sms          512 Sep 22 11:46 .failover
-r--------   1 root     sys           17 Sep 23 12:08 .remotesc
drwxr-xr-x   2 root     sms          512 Oct  1 11:00 .wcapp
drwxrwx---+  2 root     sms          512 Oct  1 11:00 A
drwxrwx---+  2 root     sms          512 Sep 29 14:17 B
drwxrwx---+  2 root     sms          512 Sep 27 10:27 C
drwxrwx---+  2 root     sms          512 Sep 27 10:27 D
drwxrwx---+  2 root     sms          512 Sep 22 11:51 E
drwxrwx---+  2 root     sms          512 Sep 22 11:51 F
drwxrwx---+  2 root     sms          512 Sep 22 11:51 G
drwxrwx---+  2 root     sms          512 Sep 22 11:51 H
drwxrwx---+  2 root     sms          512 Sep 22 11:51 I
drwxrwx---+  2 root     sms          512 Sep 22 11:51 J
drwxrwx---+  2 root     sms          512 Sep 22 11:51 K
drwxrwx---+  2 root     sms          512 Sep 22 11:51 L
drwxrwx---+  2 root     sms          512 Sep 22 11:51 M
drwxrwx---+  2 root     sms          512 Sep 22 11:51 N
drwxrwx---+  2 root     sms          512 Sep 22 11:51 O
drwxrwx---+  2 root     sms          512 Sep 22 11:51 P
drwxrwx---+  2 root     sms          512 Sep 30 14:21 Q
drwxrwx---+  2 root     sms          512 Sep 22 11:51 R
-rw-r-----   1 sms-dsmd sms          288 Oct  1 21:01 dsmd_domain_info
srwxrwxrwx   1 sms-efe  sms            0 Oct  1 11:02 efeSock
-rw-r--r--   1 sms-osd  bin           72 Oct  1 17:58 osdTimeDeltas
-rw-r--r--   1 root     root           5 Oct  1 10:58 ssd_loop.pid
    sms-svc> cd .failover
sms-svc> ls -la
total 12
drwxrwxr-x   6 root     sms          512 Sep 22 11:46 .
drwxrwxr-x+ 23 root     sms          512 Oct  2 16:30 ..
drwxrwxr-x   2 root     sms          512 Oct  5 10:15 chkpt
drwxrwxr-x   2 root     sms          512 Sep 22 11:51 fomd
drwxrwxr-x   2 root     sms          512 Sep 22 11:46 local
drwxrwxrwx   2 root     sms          512 Oct  5 10:55 tmp
    sms-svc> cd chkpt
sms-svc> ls -la
total 10
drwxrwxr-x   2 root     sms          512 Oct  5 10:15 .
drwxrwxr-x   6 root     sms          512 Sep 22 11:46 ..
-rw-r--r--   1 root     other        544 Oct  1 17:32 2.128.1.0
-rw-r--r--   1 root     other        544 Oct  1 11:03 2.130.1.0
-rw-rw-rw-   1 root     other        434 Oct  5 10:15 chkpt.list

Ultimately, changing the permissions on only the /var/opt/SUNWSMS/SMS1.4.1/data/.failover/chkpt directory would allow for SMS to write to the particular chkpt file, but there is no telling if other problems might be resolved now by fixing what was actually root cause, which is the bad group ownership of the top level directories.

So, the fix is to issue the commands as root:

    cd /var/opt/SUNWSMS/SMS1.4.1/data
chgrp -R sms .failover
chgrp -R sms .cod

Please see Additional Information for more suggestions.



Additional Information
It is important to note that the group ownership issue could be the result of a tar restore or cpio restore that did not preserve original group and owner settings, or it might just be the result of someone having manually set these ownership/permission themselves for some reason. The fact is that it would be hard to prove either way after the fact. If one directory or file is configured incorrectly, assume all are.
Confirming the configuration is correct should be the next step. Obtain access to a separate "known good" SC to compare configuration, or log a case with Sun[TM] Support to obtain help in making sure permissions and ownership is correct.

It's also a good idea to confirm that the SMS daemons have the correct UID as well. From /etc/passwd, the UID is as follows for the various daemons:

sms-codd:x:10:54:SMS Capacity On Demand Daemon::
sms-dca:x:11:54:SMS Domain Configuration Agent::
sms-dsmd:x:12:54:SMS Domain Status Monitoring Daemon::
sms-dxs:x:13:54:SMS Domain Server::
sms-efe:x:14:54:SMS Event Front-End Daemon::
sms-esmd:x:15:54:SMS Environ. Status Monitoring Daemon::
sms-fomd:x:16:54:SMS Failover Management Daemon::
sms-frad:x:17:54:SMS FRU Access Daemon::
sms-osd:x:18:54:SMS OBP Service Daemon::
sms-pcd:x:19:54:SMS Platform Config. Database Daemon::
sms-tmd:x:20:54:SMS Task Management Daemon::
sms-svc:x:6:10:SMS Service User:/export/home/sms-svc:/bin/csh
sms-efhd:x:21:54:SMS Error and Fault Handling Daemon::
sms-elad:x:22:54:SMS Event Log Access Daemon::
sms-erd:x:23:54:SMS Event Reporting Daemon::



Product
Sun Fire E25K Server
Sun Fire E20K Server
Sun Fire 15K Server
Sun Fire 12K Server

Internal Comments
Reference Apollo Escalation 1-4203640, Radiance case ID 64289130



frad, esmd, sms, fru, fruaccess, chkpt, checkpoint, write failure
Previously Published As
78507

Change History
Date: 2004-10-05
User Name: 7058
Action: Approved
Comment: Fixed document format with STM.
Fixed a few grammar errors.
Added technology area metatags.
OK to publish now.
Version: 3
Date: 2004-10-05
User Name: 7058
Action: Accept
Comment:
Version: 0
Date: 2004-10-05
User Name: 146765
Action: Approved
Comment: Good document with great details.
Please publish.
Version: 0
Product_uuid
d842dd03-059b-11d8-84cb-080020a9ed93|Sun Fire E25K Server
1404a2d3-059a-11d8-84cb-080020a9ed93|Sun Fire E20K Server
29e4659c-0a18-11d6-9fa1-e67bbc033df8|Sun Fire 15K Server
077fd4c5-df8f-4320-ad69-7d01603a674d|Sun Fire 12K Server

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback