Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1020078.1
Update Date:2011-04-05
Keywords:

Solution Type  Technical Instruction Sure

Solution  1020078.1 :   Sun SPARC(R) Enterprise Mx000 (OPL) Servers: How to deal with a hung or unresponsive domain ?  


Related Items
  • Sun SPARC Enterprise M9000-64 Server
  •  
  • Sun SPARC Enterprise M9000-32 Server
  •  
  • Sun SPARC Enterprise M8000 Server
  •  
  • Sun SPARC Enterprise M3000 Server
  •  
  • Sun SPARC Enterprise M4000 Server
  •  
  • Sun SPARC Enterprise M5000 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>OPL Servers
  •  

PreviouslyPublishedAs
251786


Applies to:

Sun SPARC Enterprise M9000-64 Server - Version: Not Applicable and later   [Release: N/A and later ]
Sun SPARC Enterprise M3000 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun SPARC Enterprise M4000 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun SPARC Enterprise M5000 Server - Version: Not Applicable and later    [Release: N/A and later]
Sun SPARC Enterprise M8000 Server - Version: Not Applicable and later    [Release: N/A and later]
All Platforms

Goal

The goal of this document is to provide some details about the Alive check mechanism and to provide some guidance on how to manage hang-up situations on OPL domains.

There is a mechanism in place that monitors the domains and detects any hang-up situation : the Alive Checking / Monitoring (aka Host Watchdog).

There are 2 parts :

  • the Alive monitoring which is monitored by SCF driver (Solaris[TM]) and XSCF,

  • the Alive monitoring which is monitored by POST/OBP and XSCF.


To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - M Series Servers

Solution

From Table 2-27 of the Sun SPARC Enterprise M3000/M4000/M5000/M8000/M9000 XSCF User's Guide:

Host watchdog  :
Based on communication between XSCF and a domain, the host watchdog function checks
whether the domain is alive (heart beat or alive check). XSCF periodically monitors the
operational status of Solaris OS, to detect the Solaris OS hang-up. When detected the Solaris
OS hang-up, XSCF generates a Solaris OS panic on the relevant domain. To enable or disable
host watchdog, set the configuration file of scfd driver (scfd.conf) that installed in the
Solaris OS of the relevant domain. By enabling host watchdog, XSCF monitors the relevant
domain.

Monitoring the Solaris domain via the SCF driver

It's the responsibility for the SCF driver 'scfd' running on the domain to initiate some communication with the 'cmd' process running on the XSCF.
Via this communication, XSCF will be able to determine if the domain is still alive or not.

The two functions on the XSCF and the domain are implemented independently.
This is done at a low level, by sending Alive interrupts via the SRAM structure and the SCFI Rx/Tx buffers.

First of all, the Alive checking must be enabled :

  • On the XSCF :

When Secure Mode is set to “on” then the Alive checking function is enabled.
When the keyswitch is in the Locked position then the Alive checking function is enabled.
Note: the above 2 conditions must be met for the function to be enabled
XSCF> setdomainmode -d 0 -m secure=on
Diagnostic Level    :min        -> -
Secure Mode         :off        -> on 
Autoboot            :on         -> -
CPU Mode            :auto       -> -
The specified modes will be changed.
Continue? [y|n] :y
configured.
Diagnostic Level    :min
Secure Mode         :on (host watchdog: available  Break-signal:non-receive)
Autoboot            :on (autoboot:on)
CPU Mode            :auto
  • On the domain :

There is a configuration file for the scfd driver /platform/SUNW,SPARC-Enterprise/kernel/drv/scfd.conf.
This file contains a scf-alive-check-function parameter that must be set to on.

Note : the default is "off"

#  When scf-alive-check-function is set to "on", it starts the Alive check
# function by XSCF. If XSCF detected abnormality of the domain, OS panic of

#  the domain is executed. The default is "off".
#   "on"  : Starts the Alive check function
#   "off" : Stops the Alive check function
scf-alive-check-function="on";

In order to change the default setting, you have to :

    1. Login as root on the domain,

    2. Edit the /platform/SUNW,SPARC-Enterprise/kernel/drv/scfd.conf file and change the value “off” to “on”.

    3. Reboot the domain.

Note : the mechanism must be enabled on both the XSCF and on the Solaris domain for the Alive check function to be enabled.

What happens when a domain hangs while running Solaris ?

So if the SCF Driver does not respond to the Alive interrupts from XSCF then FMA on the XSCF may take the following actions:

  • msg-fail :

The domain did not respond to a keepalive message. The lack of response is probably due to a software problem on the domain. If this happens, the XSCF will send a message to the domain asking it to panic.

Feb 03 08:41:24.1731 ereport.chassis.domain.keepalive.msg-fail
Feb 03 08:46:29.7211 ereport.chassis.domain.panic   

  • panic-fail :

The domain did not panic in response to the panic request. The lack of a panic is probably due to a software problem, although there may be a hardware problem that caused this.
If the panic does not happen, the XSCF will send an XIR interrupt to the domain.

Feb 03 11:17:37.8046 ereport.chassis.domain.keepalive.msg-fail
Feb 03 11:17:38.6879 ereport.chassis.domain.keepalive.panic-fail
Feb 03 11:31:15.9170 ereport.chassis.domain.panic    

Note : the "ereport.chassis.domain.panic" is due to the sync command issued when dropped to OBP.

  • xir-fail :

The domain does not respond to the XIR interrupt. This is likely to be a hardware problem, although there are software problems that can cause this situation.
If the domain does not respond to the XIR interrupt, the XSCF will attempt to reset the domain.

Feb 03 08:41:24.1731 ereport.chassis.domain.keepalive.xir-fail

  • reset-fail :

The domain reset did not occur. This is certain to be a hardware problem (domain software is not involved with a reset request).
If the reset does not occur, the XSCF powers the domain down.

Feb 03 08:41:24.1731 ereport.chassis.domain.keepalive.

Besides the FMA logs, 'showlogs error' would also report something like :

Date: Feb 03 08:41:24 UTC 2009     Code: 60000000-fcff0000-0109000100000000
    Status: Warning                Occurred: Feb 03 08:41:23.888 UTC 2009
    FRU: /DOMAIN#0
    Msg: Domain hang-up detected (level0), DID 0, path 0
Date: Feb 03 08:41:25 UTC 2009     Code: 60000000-fcff0000-0109000200000000
    Status: Warning                Occurred: Feb 03 08:41:23.927 UTC 2009
    FRU: /DOMAIN#0
    Msg: Domain hang-up detected (level1), DID 0, path 0
Date: Feb 03 08:46:29 UTC 2009     Code: 60000000-ffffffff-0109001500000000
    Status: Warning                Occurred: Feb 03 08:44:25.280 UTC 2009
    FRU: /UNSPECIFIED
    Msg: XSCF command: System status change (OS panic) (DID#00, path: 00)


Monitoring panic

At any time, if a panic request occured and there is no response from the domain, the same course of actions as described in the "panic-fail" section above will happen, but only if the Secure mode is on and keyswith is the Locked position.

Note : The keyswith position and Secure mode setting are available in the snapshot.


Oct 22 18:53:08.6733 ereport.chassis.domain.panic
Oct 22 19:18:09.0490 ereport.chassis.domain.keepalive.panic-fail

Date: Oct 22 18:53:08 CDT 2008     Code: 60000000-ffffffff-0109001500000000
    Status: Warning                Occurred: Oct 22 18:48:08.422 CDT 2008
    FRU: /UNSPECIFIED,/UNSPECIFIED
    Msg: XSCF command: System status change (OS panic) (DID#00, path: 00)
    Diagnostic Code:
        00000000 00000000 00000000
        00002140 01000000 00000000 00000000
        00000000 00000000 00000000 00000000
    UUID: 8e53ef5d-10d3-4a6b-bd1d-fa757f247f8d MSG-ID: SCF-8005-PX
Date: Oct 22 19:18:09 CDT 2008     Code: 60000000-fcff0000-0109000700000000
    Status: Warning                Occurred: Oct 22 19:18:08.480 CDT 2008
    FRU: /DOMAIN#0
    Msg: Domain hang-up detected (panic), DID 0
    Diagnostic Code:
        00000000 00000000 00000000
        00000000 00002000 00000000 00000000
        00000000 00000000 00000000 00000000
    UUID: 8be1af8c-313b-4a03-9c22-4215d33ac89c MSG-ID: SCF-8005-US
Date: Oct 22 19:18:24 CDT 2008     Code: 60000500-ffff0000-0300000800030000
    Status: Warning                Occurred: Oct 22 19:18:23.791 CDT 2008
    FRU: /UNSPECIFIED
    Msg: Externally initiated reset occurred
    Diagnostic Code:
        ffffffff ffff0000 00000000
        58495200 00000000 00000000 00000000
        00000000 00000000 00000000 00000000
    UUID: a54ebfc1-a260-4bff-b108-5f65819bffc3 MSG-ID: SCF-8008-3U
    Diagnostic Messages


Monitoring POST/OBP

What happens when a domain hangs while running POST/OBP ?

 

Note : POST/OBP are monitored regardless of the keyswitch position (Service / Locked).

The mechanism is slighly different, as monitoring POST/OBP does not use the Alive interrupts.

If the domain is running POST or OBP, then msg-fail, panic-fail, and xir-fail cannot occur. Instead, if the keepalive fails, the XSCF will immediately perform a domain reset.
If this domain reset fails, then the reset-fail ereport will be issued.

In order to investigate further after a domain has been forced to XIR/panic/reset by XSCF, you must collect a full snapshot, a domain explorer and any existing corefiles.

If the Alive check is not enabled or no action has been taken by the XSCF to recover from the hang-up situation then a manual operation is required from the user.


Steps to Follow
This section describes how to recover from a hang-up situation and provides a step-by-step procedure to deal with such a situation and collect the appropriate information for post-mortem analysis.

First of all, check the setting for the Secure mode (showdomainmode) and the position for the keyswitch (showhardconf). This may influence the result for the above actions.
Assuming that you haven't been been able to establish any connexion to the domain or to find any logs available to explain what's happening with the domain (as described in the previous section – How to deal with a “hung” domain ?), in order to recover the domain, you may try the following steps :

1. Send a break to the domain

XSCF> sendbreak -d DID

If the domain drops to OBP then force a panic using the 'sync' command.

# Type  'go' to resume
{18} ok sync
panic[cpu8]/thread=2a1001dfca0: sync initiated
sched: software trap 0x7f
pid=0, pc=0xf005d18c, sp=0x2a1001decb1, tstate=0x4480001403, context=0x0
g1-g7: 1050404, 0, 18b4800, 0, 0, 0, 2a1001dfca0
00000000fdb7bcd0 unix:sync_handler+144 (182e400, 1b, 0, 1, 1, 109bc00)
  %l0-3: 000000000188dc90 00000000018d9aa8 00000000018d9800 000000000000017f
  %l4-7: 00000000018bb000 0000000000000000 00000000018b4800 000000000000001b
00000000fdb7bda0 unix:vx_handler+80 (fdb64000, 183dd10, 1896400, 1, 183de18, f006d515)
  %l0-3: 000000000183de18 0000000000000000 0000000000000001 0000000000000001
  %l4-7: 000000000182ec00 00000000f0000000 0000000001000000 0000000001019a68
00000000fdb7be50 unix:callback_handler+20 (fdb64000, fdc30400, 0, 0, 0, 0)
  %l0-3: 0000000000000016 00000000fdb7b701 0000000000000000 0000000000254030
  %l4-7: 0000000000000106 0000000000000000 0000000000000000 00000000018e5800
syncing file systems... done
dumping to /dev/dsk/c0t0d0s1, offset 108396544, content: kernel


Note : it's also possible to break the domain by using the "CTRL-\" combination.
Note : in order for the sendbreak to break the domain, the “Secure Mode” for the domain must be set to “Off”. This can be confirmed via the 'showdomainmode' dommand.
Note : panic does check the auto-boot? OBP variable or Autoboot variable values. This is controlled by the "halt_on_panic" /etc/system parameter on the domain.

At this stage, the domain should restart and a coredump is available for postmortem analysis.
Make sure to collect a full snapshot (snapshot -L F) for a proper analysis as well as the domain explorer and corefiles.


2. Try to force the domain to panic via the reset command

XSCF> reset -d 0 panic
DomainID to panic:00
Continue? [y|n] :y
00 :Panicked

Note : the reset command will panic the domain whatever the value of the “Secure Mode”.

panic[cpu17]/thread=2a100975ca0: System Panel Driver: Emergency panic request detected!
000002a1009dddf0 oplpanel:panel_intr+a0 (6002188f9d8, 10, 7bf37800, 16, 0, 188e800)
  %l0-3: 000002a1009ddda8 000002a1009dddd0 0000000000000037 00000000018e3c00
  %l4-7: 0000000000000000 0000000000000001 000000007009bc00 0000000000000011
000002a1009ddea0 pcicmu:pcmu_intr_wrapper+54 (3000282d348, 0, 30003056a48, 30006c30000, 8000, 1)
  %l0-3: 00000000018e2800 0000000000000001 0000060021818770 0000000000000000
  %l4-7: 0000000000000001 00000000018e3d64 000006002188f9d8 000000007bf376b4
000002a1009ddf50 unix:current_thread+164 (1, 600219b8ca8, f0d0f0f, f0d0f0f, 0, 1b)
  %l0-3: 00000000010076c8 000002a100974fe1 000000000000000f 000000007002c580
  %l4-7: ffffffffffffffff 000006002cd9e6a8 0000000000000000 000002a100975890
000002a100975930 unix:cpu_halt+180 (16, 18baf60, 11, 1, 16, 30006c30000)
  %l0-3: 0000000000000000 0000000000000001 0000000000000001 0000000001266800
  %l4-7: 000000000f0f0f0f 0000000000020000 0000000000000001 0000000000000011
000002a1009759e0 unix:idle+128 (1832000, 0, 30006c30000, ffffffffffffffff, a, 1831000)
  %l0-3: 00000600219b8ca8 000000000000001b 0000000000000000 ffffffffffffffff
  %l4-7: 00000000018e0c00 0000000000000000 000000000000042c 000000000103ed8c
syncing file systems... done
dumping to /dev/dsk/c0t0d0s1, offset 108396544, content: kernel

At this stage, the domain should restart and a coredump is available for postmortem analysis.
Make sure to collect a full snapshot (snapshot -L F) for a proper analysis as well as the domain explorer and corefiles.


3. Try to send a XIR to the CPUs for the domain via the reset command

XSCF> reset -d 0 xir  
DomainID to reset:00
Continue? [y|n] :y
00 :Reset

If the domain drops to OBP then force a panic using the 'sync' command.

Note : the reset command will XIR the domain regardless the value of the “Secure Mode”.

ERROR: Externally Initiated Reset has occurred.

{19} ok sync

panic[cpu25]/thread=2a100b55ca0: sync initiated
sched: trap type = 0x3
pid=0, pc=0x1266894, sp=0x2a100b55131, tstate=0x80001605, context=0x0
g1-g7: 0, 18baf60, 0, 30006c42000, 600219b8b88, 3c, 2a100b55ca0
00000000fdb7bcd0 unix:sync_handler+144 (182e400, 1b, 0, 1, 1, 109bc00)
  %l0-3: 000000000188dc90 00000000018d9aa8 00000000018d9800 0000000000000003
  %l4-7: 00000000018bb000 0000000000000000 00000000018b4800 000000000000001b
00000000fdb7bda0 unix:vx_handler+80 (fdb64000, 183dd10, a00003c3ffbf0066, 0, 183de18, f006d515)
  %l0-3: 000000000183de18 0000000000000000 0000000000000001 0000000000000001
  %l4-7: 000000000182ec00 00000000f0000000 0000000001000000 0000000001019a68
00000000fdb7be50 unix:callback_handler+20 (fdb64000, fdc96400, 0, 0, 0, 0)
  %l0-3: 0000000000000016 00000000fdb7b701 0000000000000002 0000000000000001
  %l4-7: 0000060021a25e20 0000000000000000 0000000000000000 000002a100bbdde8
syncing file systems... done
dumping to /dev/dsk/c0t0d0s1, offset 108396544, content: kernel

At this stage, the domain should restart and a coredump is available for postmortem analysis.
Make sure to collect a full snapshot (snapshot -L F) for a proper analysis as well as the domain explorer and corefiles.


4. Try to power-on-reset the domain via the reset command

XSCF> reset -d 0 por
DomainID to reset:00
Continue? [y|n] :y
00 :Reset

The domain will be reset and POST will be invoked for the domain.

XSCF> reset -d 0 por
DomainID to reset:00
Continue? [y|n] :y
00 :Reset

*Note*
 This command only issues the instruction to reset.
 The result of the instruction can be checked by the "showlogs power".

XSCF> showdomainstatus -a
DID         Domain Status
00          Initialization Phase



Note : the reset command will XIR the domain whatever the value of the “Secure Mode”.
At this stage, the domain should restart if POST does not detect any further problem.

Note : if the autoboot XSCF parameter or the auto-boot? OBP parameter is set to off/false the domain will not automatically reboot and will stop at the OK prompt.

'sync' could be invoked to force a core dump when dropped to the OBP.

Note : the keyswitch in the Service position would also abort the boot sequence.

Make sure to collect a full snapshot (snapshot -L F) for a proper analysis as well as the domain explorer.



5. Power cycle the platform

The ultimate action, if none of the previous actions has succeeded, would be to power cycle the platform.
Of course, this will impact all of the running domains in the platform.

Make sure to collect a full snapshot (snapshot -L F) for a proper analysis as well as the domain explorer.


6. Post-mortem analysis

6.1 - Data collection

Whatever the procedure used to recover the domain from the hang-up situation, this will require a post-mortem analysis in order to understand what happened to the domain.
The minimum data to be collected is :

 

  • a fresh full snapshot (snapshot -L F) from XSCF

  • a fresh explorer from the domain

If a coredump generation was successful from the previous steps, then the corefile must be collected.
These pieces of information must be provided to the TSC engineer in order understand what happened.


6.2 - OPL specific information

Besides the "regular" coredump and the regular "explorer" analysis, the full snapshot may provide some useful information in order to investigate the hang-up situation further.

Of course, the FMA logs must be checked as well as the other logs : monitor, error, event, panic.
Also the console logs may contain some useful information.

6.3 - error-reset-recovery

When solaris is running and a RED_state trap or Watchdog reset occurs, the domain may react depending on the setting of the error-reset-recovery OBP parameter.
When OBP configuration variable "error-reset-recovery" is "boot" (default value), OBP reports the watchdog log, and direct the issue of POR to XSCF, and reboot the domain.
When the variable is "none", OBP reports the log, and stops by the OBP prompt.
When the variable is "sync", OBP reports the log, and execute a callback command"sync".

When the RED_state trap or Watchdog Reset occurs while initializing OBP, it does not depend on the setting of error-reset-recovery, and it stops by the OBP prompt.

When a XIR is issued, it does not depend on the setting of error-reset-recovery, and it stops by the OBP prompt.




Internal Comments

The following information can be useful during internal troubleshooting of a hung or unresponsive domain:

 XSCF will operate based on the id_code value (Monitoring Target ID Code
 indicates the component of monitoring target). This value is updated during the domain poweron sequence.

 /******************* Alive id_code  ******************/
 #define CMEM_ALIVE_ID_POST 0x1     /**< Alive_watch POST */
 #define CMEM_ALIVE_ID_OBP 0x2     /**< Alive_watch OBP */
 #define CMEM_ALIVE_ID_SCFDRV 0x10    /**< Alive_watch SCF Driver*/

 While checking an explorer/snapshot, since the scfd.conf is not collected, it's possible to determine if the Alive check is enabled or not by
 dumping the id_code BDB value for the domain. Use the dbdump tool available in the toolset.

 Examples :
* The domain 0 is running with  scf-alive-check-function="off" or is currently running OBP :
  bash-3.00$ dbdump -l
  cmem.current.current_domain_info[0].id_code = 02

* The domain 1 is running with scf-alive-check-function="on" :
  bash-3.00$ dbdump -l
  cmem.current.current_domain_info[1].id_code = 10

 Note : The keyswith position and Secure mode setting is obviously available in the snapshot.

 When the Alive check is enabled, XSCF is using some timeout parameters to monitor the domains. Those parameters are configurable in the scfd.conf file :

 Note : it may not be appropriate to change this default setting
       * scf-alive-interval-time :   The interval time that the service processor (XSCF) periodically monitors Solaris. 
          Specify this parameter in minutes. The range is 1 - 10 minutes. The default is 2  minutes.

scf-alive-interval-time=2

Note: The Interrupt interval scf-alive-interval-time must be less than the monitoring timeout scf-alive-monitor-time.

     * scf-alive-monitor-time :   The time that the service processor (XSCF) detects
        Solaris[TM] hang-up.  The service processor (XSCF) executes OS panic by timeout of this timer. 
        Specify this parameter in minutes. The range is 3 - 30 minutes. The default is 6 minutes.

 scf-alive-monitor-time=6;

Note: The value of scf-alive-monitor-time should be bigger than the scf-alive-interval-time value.
     * scf-alive-panic-time :   the time that the service processor (XSCF) detects OS panic hang-up.
         The service processor (XSCF) executes the system reset (XIR) by timeout of this timer.
          Specify this parameter in minutes. The range is 30 - 360 minutes. The default is 30 minutes.

 scf-alive-panic-time=30;


 More information on which software defects the HCP software can detect is available at
 http://re.west/menus/SW_Projects/Current/OPL-SP/builds/nightly/ppc/testFF_P/col2sun/build/noarch/docs/fm/scf.html/sw.html

 The levels reported in the showlogs error output for a domain hangup, are defined as follows:

 /************************************
 Alive Watch Level
  ***********************************/
 #define  CMEM_ALIVE_PATH_CHANGE      0x00  /**< SCFI path change */
 #define  CMEM_ALIVE_LV_1_PANIC  0x01   /**< Panel Request */
 #define  CMEM_ALIVE_LV_2_XIR  0x02   /**< Xir */
 #define CMEM_ALIVE_LV_3_RESET 0x03   /**< reset */
 #define CMEM_ALIVE_LV_4_FPOFF 0x04   /**< F-POFF */
 #define CMEM_ALIVE_NO_ERROR 0xFF   /**< Alive No error */

 So for each step described above (msg-fail, panic-fail etc...), the system hangup level will be incremented.  This information is also available in the alive_level BDB field.

 bash-3.00$ % dbdump -l cmem.current.current_domain_info[0].alive_level
 cmem.current.current_domain_info[0].alive_level = ff

 This section provides some more internal information for Sun employees to investigate hung domains. The Red log can be a very valuable information at time of diagnosing a hang-up situation.

 In some cases, probably in some rare situations, going through the SCF  Traces may help to understand what happened.

 Note also that, using the Snapshot Analysis Toolset, you may use the  off-platform viewer to read the 'showlogs obp' or 'showlogs detail' output.

 Using the Snapshot Analysis Toolset, it will also possible to read the  Redlog which could contain some helpful information in the context of a Solaris[TM] hang.
 Redlog alone may not be very helpful but combined with a corefile, this might be decisive to determine the rootcause of  a hang.


 The redlog information records the detail information at the time occurred  the RED State exception.  Reset traps like WDR, XIR, RED cause CPUs to  enter RSTV(Reset Vector).

 When the RED_state trap or Watchdog Reset is generated while OS is operating,  OBP records all content of various internal registers of CPU and TLB in SRAM.   XSCF keeps the log only one generation for each CPU chip.  So, it will be  overwritten by next RED occurrence.

 The information collected is equivalent to the following OBP commands :

     * show-cpu-registers
     * show-regs&stack-all

 The RED log will be saved in the XSCF filesystem.  There is one redlog  file per chip (up to 8 strands) : red_log_xx_yy where :

     * xx: CMU number(0x00-0x0F)
     * yy: CPU Chip number in the CMU (0x00-0x03)

 Those files are collected by snapshot.  Note : Only full snapshots (obtained by snapshot -L F) contain redlog information.

 Example from a M5000 snapshot :
 bash $ ls
 xscf_logs/scf/log/red_log_0*
 xscf_logs/scf/log/red_log_00_00
 xscf_logs/scf/log/red_log_00_03
 xscf_logs/scf/log/red_log_01_02
 xscf_logs/scf/log/red_log_00_01
 xscf_logs/scf/log/red_log_01_00
 xscf_logs/scf/log/red_log_01_03
 xscf_logs/scf/log/red_log_00_02
 xscf_logs/scf/log/red_log_01_01

 In each file, XSCF adds the header to the data sent by OBP as below:

     * 0x00-0x07: Timestamp
     * 0x08-0x09: LOG-ID
     * 0x0A-0x0F: Reserve
     * 0x0010- 0x200F: Strand0 RED log data sent by OBP
     * 0x2010- 0x400F: Strand1 RED log data sent by OBP
     * 0x4010- 0x600F: Strand2 RED log data sent by OBP
     * 0x6010- 0x800F: Strand3 RED log data sent by OBP
     * 0x8010- 0xA00F: Strand4 RED log data sent by OBP
     * 0xA010- 0xC00F: Strand5 RED log data sent by OBP
     * 0xC010- 0xE00F: Strand6 RED log data sent by OBP
     * 0xE010-0x1000F: Strand7 RED log data sent by OBP

 The log type can be :

     * WDR_LOG : Watchdog Reset
     * RED_LOG : RED state trap
     * XIR_LOG : XIR


 The snapshot analysis toolset (CLI and Web versions  Off-platform/Showlogs/showlogs redlog) provide an off-platform viewer  to read these logs.

 bash-3.2$ showlogs
 usage:  showlogs [-t time [-T time]|-p timestamp] [-v|-V|-S] [-r] [-M] error
 [...]
 showlogs redlog [-chip ] [-d] [-c] [-x] [-l] [-t] [-s] [-v] [-nl]  [-nt] [-nx]
 -d (debug) -c (cpuid) -x (expand TLB)
 -l (show local registers) -t (show TLB)
 -s (silent) -v (verbose)
 -nl (no local registers)
 -nt (no TLB informations)
 -nx (no expand TLB)

Let's take a look at an example and let's see the information provided for the first 2  strands from the chip 01_02.

chip 01_02 is the 3rd chip on CMU#1

 CMU#1 Status:Normal; Ver:0101h; Serial:PP06446534 ;
 + FRU-Part-Number:CA06620-D002 B1 /371-2214-02 ;
 + Memory_Size:64 GB;
 [...]
 CPUM#2-CHIP#0 Status:Normal; Ver:0301h;
 Serial:PP072701K6 ;
 + FRU-Part-Number:CA06620-D024 A1 /371-2216-01 ;
 + Freq:2.400 GHz; Type:16;
 + Core:2; Strand:2;

 Where the XSB is assigned to the LSB 01

 XSB R DID(LSB) Assignment Pwr Conn Conf Test Fault COD
 ---- - -------- ----------- ---- ---- ---- ------- -------- ----
 01-0 00(01) Assigned y y y Passed Normal n

 So the file red_log_01_02 will contain the information from the 4  strands from this CPUM :

 bash-3.2$
 showlogs redlog -chip 01_02 | grep CPUID
 ******* shoe_oplredlog version 4.21 *******
 CPUID : 030 (cpu48)
 CPUID : 031 (cpu49)
 CPUID : 032 (cpu50)
 CPUID : 033 (cpu51)

 Let's dump the log for the first strand.

 Note : the presence of the SFSR/SFAR value that can be decoded
 https://cores2-web.oraclecorp.com/cgi-bin/opltools/oplTools.cgi?SFAR=true
 https://cores2-web.oraclecorp.com/cgi-bin/opltools/oplTools.cgi?SFSR=true
 

 bash-3.2$
 showlogs redlog -chip 01_02
 ******* shoe_oplredlog version 4.21 *******
 File : ./xscf_logs/scf/log/red_log_01_02
 File_offset : 0x30
 DATE : Oct 08 12:06:04.675 CEST 2008
 Log-format :
 RDLJ_F10
 reset-magic : XIR_LOG
 cpu-bitmap : [..127]   f0f0f0f0f0f0f0f0.f0f0f0f000000000
                   : [..255]   0000000000000000.0000000000000000
                   : [..383]   0000000000000000.0000000000000000
                   : [..511]   0000000000000000.0000000000000000
      CPUID : 030 (cpu48)
    HOSTID : 00000000.847c8a8a

                             %tl = 00000000.00000001
                             %tba = 00000000.01000000

         TT             TPC                        TNPC                      TSTATE
 TL1: 03 00000000.01218f74  00000000.01218f78   00000000.80001607
 TL2: d8 00000000.010077b4 00000000.010077b8  00000044.00001500
 TL3: 68 00000000.01005988 00000000.0100598c  00000008.10001504
 TL4: 00 00000000.00000000 00000000.00000000  00000000.00000000
 TL5: 01 00000000.00000000  00000000.00000000 00000000.00000000

                           %ecr[10,4c] = 00000000.00000002 ( WEAK_ED )
                          %isfsr[18,50] = 00000000.00008008 ( No Error )
                        %isfpar[78,50] = 00000000.00000000
                         %dsfsr[18,58] = 00000000.00808007 ( FV OW W TM ASI:80 )
                       %dsfpar[78,58] = 00000000.00000000
                         %dsfar[20,58] = 00000000.ff343f08
                  %dfault-adr[30,58] = 00000601.03214000
                           %afsr[00,4c] = 00000000.00000000 ( No Error)
                         %ugesr[08,4c] = 00000000.00000000 ( No Error)
                      %stchger[18,4c] = 00000000.00000000 ( No Error )
                       %iiu-insttrap[00,60] = 00000000.00000000
                       %dev-serial[00,53] = 00009100.a6f0a105 (f-45598-47-3133-0 )

                    %pstate = 00000000.00000035                      %ccr = 00000000.00000044
                          %asi = 00000000.00000015                       %pil = 00000000.00000000
                            %y = 00000000.00000000                     %fprs = 00000000.00000000
                     %softint = 00000000.00010000                    %cwp = 00000000.00000007
                  %cansave = 00000000.00000005           %canrestore = 00000000.00000001
                 %otherwin = 00000000.00000000                 %wstate = 00000000.0000000e
                 %cleanwin = 00000000.00000007                      %ver = 00040006.92000507
   %int-vector0[40,7f] = 00000000.00000416
   %int-vector2[50,7f] = 00000000.00000000
   %int-vector4[60,7f] = 00000000.00000000
     %jb-config[00,4a] = 00000000.0000a030
     %pcontext[08,58] = 00000000.00000000
      %scontext[18,58] = 00000000.00000000
             %eidr[00,6e] = 00000000.00002030
  %i8k-tsb-ptr[00,51] = 0000034f.e000b840
 %i64k-tsb-ptr[00,52] = 0000034f.e000af00
  %d8k-tsb-ptr[00,59] = 0000034f.e00090a0
%d64k-tsb-ptr[00,5a] = 0000034f.e000b210
           %dcucr[00,45] = 00000000.00000000
 %mcntl[08,45] = 00000000.00002000
 %itagtarget[00,50] = 00000000.000001eb
 %itsb[28,50] = 0000034f.e0008001
 %itsb-pext[48,50] = 7fffffff.fffff00f
 %itsb-next[58,50] = 7fffffff.fffff00f
 %dtagtarget[00,58] = 00000000.0018040c
 %dtsb[28,58] = 0000034f.e0008001
 %dtsb-pext[48,58] = 7fffffff.fffff00f
 %dtsb-next[58,58] = 7fffffff.fffff00f
 %dtsb-sext[50,58] = 7fffffff.fffff00f
 %dtsb-direct[00,5b] = 0000038f.e002ba10
 %va-wptr[38,58] = 00000000.00000000
 %pa-wptr[40,58] = 00000000.00000000
 %l2ctrl[10,6a] = 00000000.00000000
 %asi-scratch0[00,4f] = ffffffff.ffffffff
 %asi-scratch1[08,4f] = 00000000.00000001
 %asi-scratch2[10,4f] = 00000000.00000000
 %asi-scratch3[18,4f] = 80c003ce.dfc6003f
 %asi-scratch4[20,4f] = ffffffff.ffffffff
 %asi-scratch5[28,4f] = 00000000.00000007
 %asi-scratch6[30,4f] = 00000000.01218f78
 %asi-scratch7[38,4f] = 00000000.00000000 Normal Alternate MMU Vector
 %g0: 00000000.00000000 00000000.00000000 00000000.00000000 00000000.00000000
 %g1: 00000000.00000000 00000000.00000007 000003cf.fb8190a0 00000000.00000040
 %g2: 00000000.018b8dd0 00000000.01218f78 00000601.03214000 00000000.00000006
 [...]
 %g7: 000002a1.01a87ca0 00000000.80001600 000003cf.fe600000 00000000.00000030
 %o0(CWP=7) = 00000600.f561c480 %l0(CWP=7) = 00000000.00000000
 %o1(CWP=7) = 00010000.00000000 %l1(CWP=7) = 00000000.00000014
 [...]
 %o6(CWP=7) = 000002a1.01a87131 %l6(CWP=7) = 00000000.00000030
 %o7(CWP=7) = 00000000.0103d924 %l7(CWP=7) = 00000000.018b8f00
 [...]
 %o0(CWP=0) = 00000a0e.5d30566c %l0(CWP=0) = 00000000.80001607
 %o1(CWP=0) = 00000300.08a045b8 %l1(CWP=0) = 00000000.00000016
 [...]
 %o6(CWP=0) = 000002a1.01a86fe1 %l6(CWP=0) = 00000000.00000000
 %o7(CWP=0) = 00000000.0100ab9c %l7(CWP=0) = 000002a1.01a87890 fITLB
 SNI CC ###
 ----TAG--------- -----DATA------- CTX -------VA------- VZFEsw2 ----- PA----- swLPVEPWG
 000 00000000f0000000 e00003cfffc00064: 000000000000f0000000 1300000 03cfffc00000 001100100 001
 00000000fe9c205b e000034ff4000020: 005b 00000000fe9c2000 1300000 034ff4000000 000100000
 [...]
 01e 000000010241410a e000034fe5800020: 010a 0000000102414000 1300000 034fe5800000  000100000
 01f 0000000001000000 e00003cffe800064: 0000 0000000001000000 1300000 03cffe800000 001100100
 fDTLB SNI CC
 ###
 ----TAG--------- -----DATA------- CTX -------VA------- VZFEsw2 ----- PA----- swLPVEPWG
 000 ffffffffffffc000 800003cfffbde066: 0000 ffffffffffffc000 1000000 03cfffbde000 001100110 001
 00000000f0000000 e00003cfffc00066: 0000 00000000f0000000 130000003cfffc00000 001100110 002
 00000000fff10000 a00003cfffbf0066: 0000 00000000fff10000 1100000 03cfffbf0000 001100110
 [...]
 01e 0000000001000000 e00003cffe800064: 0000 0000000001000000 1300000 03cffe800000 001100100
 01f 0000000001800000 e00003cffd800066: 0000 0000000001800000 1300000 03cffd800000 001100110
 PC:  FJSV,SPARC64-VI:cpu_halt_cpu+4
 Last leaf: call  FJSV,SPARC64-VI:cpu_halt_cpu from unix:cpu_halt+188  0 w %o0-%o7: (600f561c480 1000000000000 f0e0f0f070f0f0e f0e0f0f070f0f0e 0 1b 2a101a87131 103d924 )
 jmpl unix:cpu_halt from unix:idle+128 1 w %o0-%o7: (16 18b8dd0 16 30 30008a04000 1 2a101a871e1 105b040 )
 jmpl unix:idle from unix:thread_start+4 2 w %o0-%o7: (1832000 030008a04000 ffffffffffffffff 19 1831000 2a101a87291  1046f48 )


 alive, check, monitor, hang, domain, panic, hung, scfd.conf, reset,
 por, xir, unresponsive, Mx000, M3000, M4000, M5000, M8000, M9000




Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback