Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1003319.1
Update Date:2011-06-03
Keywords:

Solution Type  Technical Instruction Sure

Solution  1003319.1 :   Sun Fire[TM] 12K/15K/E20K/E25K: POST level increments with repeated error  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
204606


Applies to:

Sun Fire E20K Server
Sun Fire 12K Server
Sun Fire 15K Server
Sun Fire E25K Server
All Platforms

Goal

The POST level on a Sun Fire 12K/15K/E20K/E25K increments if an error repeats within a given timeframe. This is a desired feature, in which dsmd increases the POST level. However, manual intervention is still required to blacklist the component. If the component is not blacklisted, it will continue to be included in the POST configuration. Eventually POST "may" deconfigure the component, but this DOES NOT blacklist the component. The user must place this in the blacklist file manually.

Once the domain recovers and is booted, any subsequent error within a four hour period will be treated as a repeated error. After this 4 hour period the domain will be considered recovered and healthy.

NOTE: The initial run of POST, after the domain is considered "healthy" again, depends on the operation which runs. If it is due to a reset, then the POST run will be a -Q (same as level 7) . If it is due to a operator-initiated action, such as setkeyswitch, then the level will be defined by the contents of the .postrc file used. The default level is 16 if not specified in the .postrc file

Solution

The example shows excerpts from the POSTs and dstop(s) where a repeated error occurred within a given time period.

redxl> ld dsmd.dstop.020929.1549.03
Created Sun Sep 29 15:49:04 2002
By hpost v. 1.2 Generic 112488-06 Jun 18 2002 15:53:15 executing as pid=21248
On ssc name = starcat_sc0
Domain = 2=C = starcat_domC Platform = platform_1
Boards in dump: master SC CPs/CSBs[1:0]: 3
EXB[17:0]: 02010
Slot0[17:0]: 02010
Slot1[17:0]: 02010
-D option, -d
"DSMD DomainStop Dump"
0 errors occurred while creating this dump.
redxl> wfail
SDI EX04/S0: All SDI is DStopped and RStopped, requested by DARB.
SDI EX13/S0 Master_Stop_Status0[31:0] = 2004004F
MStop0[3:0]: All SDI logic is DStopped + Recordstopped.
SDI EX13/S0 Dstop0[31:0] = 10019000
Dstop0[16]: D DARB texp requests all Dstop (M)
Dstop0[28]: D 1E Slot0 asserted Error, enabled to cause Dstop (M)
EPLD SB13 Err1_Dom0: Mask= 00 Err= 80 1stErr= 80
Err1[7]: 1E+ Error reported by BBC1
BBC SB13/BB1 Device_Err_Stat[31:0] = 80008100
DevErr[ 8]: 1E Port 0 Safari device asserted error
Proc SB13/P2 (13.0.2) EmuShad[0:78] = 0020 00000000 00000000 (Note rev order)
EmuSh[ 9]: THUE: Etag ECC UE due to other access (P$, W$, wrback...).
AFSR [63:0] = 03900000.00000000 AFAR [42:4] = 1A3.9E821CE_
AFSR2[63:0] = 01100000.00000000 AFAR2[42:4] = 1A3.9E821CE_
AFSR[52]: 1E PRIV: Priviledged code access error(s) occurred.
AFSR[55]: TUE: Uncorrectable Ecache tag ECC error.
AFSR[56]: 1E TSCE: SW_handled Correctable Ecache tag ECC error.
AFSR[57]: THCE: Hardware corrected Ecache tag ECC error.
FAIL Proc SB13/P2: Dstop detected by Proc SB13/P2.
Primary service FRU is Slot SB13.
DARB C0: enabled ports (expanders) [17:0]: 07E3F
DARB C0: other darb req Dstop+Rstop for exps[17:0]: 02000
DARB C1: enabled ports (expanders) [17:0]: 07E3F
DARB C1: other darb req Dstop+Rstop for exps[17:0]: 02000

This error occurred four times before POST deconfigured it. The level 64 post finally "deconfigured" (NOT Blacklisted) the suspect component. Please note, due to the error, a lesser post level could have caught this, or a greater post level may have been required to fail the component.

Note also that CHS (starting from SMS 1.4.1) could have marked the cpu as faulted; this will avoid the failed asic to be tested in subsequent posts.

Finally, below you can see the post level changing in the various post runs.

NOTE: A "Short" post is the post execution of dumping ASIC state for capture into the rstop or dstop dump file. Successful dump captures have an exit code of 85. Unsuccessful exit codes are 86 or 87, depending on failure. Obviously, regardless of whether an rstop or dstop occurs, we will generate a short post log from the "capture of ASIC state". The only difference is that after a dstop, we will reboot the domain, in which case a full long post log should appear.


Therefore, in this scenario you see eight post files, four of which are generated as a result of capturing the ASIC states.

post020929.1549.04.log:# pid = 21248 level = 16 verbose_level = 20
--Short post

post020929.1550.13.log:# pid = 21409 level = 16 verbose_level = 20
post020929.1550.13.log:# Cmdline: /opt/SUNWSMS/SMS1.2/bin/hpost -d C -a -Palt_level 16
--Configured in 333 with 8 procs, 32.000 GBytes, 6 IO adapters.

post020929.1558.16.log:# pid = 22516 level = 16 verbose_level = 20
--Short post

post020929.1559.12.log:# pid = 22650 level = 16 verbose_level = 20
post020929.1559.12.log:# Cmdline: /opt/SUNWSMS/SMS1.2/bin/hpost -d C -a -Palt_level 16
--Configured in 333 with 8 procs, 32.000 GBytes, 6 IO adapters.

post020929.1607.21.log:# pid = 23770 level = 16 verbose_level = 20
--Short post

post020929.1608.22.log:# pid = 23913 level = 32 verbose_level = 20
post020929.1608.22.log:# Cmdline: /opt/SUNWSMS/SMS1.2/bin/hpost -d C -a -Palt_level 32
--Configured in 333 with 8 procs, 32.000 GBytes, 6 IO adapters.

post020929.1620.10.log:# pid = 25531 level = 16 verbose_level = 20
--Short post

post020929.1621.05.log:# pid = 25660 level = 64 verbose_level = 20
post020929.1621.05.log:# Cmdline: /opt/SUNWSMS/SMS1.2/bin/hpost -d C -a -Palt_level 64
--Configured in 333 with 7 procs, 28.000 GBytes, 6 IO adapters.


Again, please note that the proc 418 was missing from showdevices and psrinfo, but was NOT blacklisted.

Product
Sun Fire 15K Server
Sun Fire E25K Server
Sun Fire E20K Server
Sun Fire 12K Server

Internal Section

Keywords: POST, level, alt_level, 12K, 15K, 20K, 25K

Previously Published As 48395



Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback