Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1007746.1
Update Date:2011-05-10
Keywords:

Solution Type  Technical Instruction Sure

Solution  1007746.1 :   SunFire[TM] 12K/15K/E20K/E25K: Expected behavior of domains in different scenarios when the SCs are powered down or rebooted  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
210728


Applies to:

Sun Fire 12K Server
Sun Fire 15K Server
Sun Fire E20K Server
Sun Fire E25K Server
All Platforms

Goal

What happens to my running domains when both System Controllers are powered off or at the ok prompt?

The answer to this question varies depending on the details of the specific scenario. This document will explain those different scenarios and expected outcomes.

Solution

To begin, let's take a look at three important services that the System Controller provides to the domain.
  1. They each provide a system clock source to each domain in the platform.
  2. They provide a means of monitoring the hardware environment of the platform.
  3. They monitor the operating system state of the domains.

First, each SC provides a 75 MHz clock source to the entire platform. The clock is generated by hardware on the SC and is present when the SC has power. The two clock sources (one from each SC) are synchronized so that if one fails, the domains can continue to run off the other clock source. If an SC is at the ok prompt, it will still provide a clock source to the domains. 

Note: with SMS 1.4 and higher, you can find out the clock status using 'showboards -c'.

Second, the MAIN SC monitors the environmental status of each component in the platform. This is accomplished by the esmd daemon over the I2C bus. The esmd daemon monitors for high or low temperatures, voltages, and current levels. If esmd detects a dangerous value, it can signal SMS to take the appropriate action to protect the hardware - for example, increasing fan speed or powering off components. Since this is part of the SMS software, it will not run when SMS is stopped.

Third, SMS will monitor the Operating System on each domain to ensure that it remains up and running. The SC periodically sends out a heartbeat signal to the domain. If it doesn't receive a timely response, it will send a reset to the domain to recover it. The SC will also restore the state of the domain if it panics or crashes due to a hardware stop. If SMS is not running and you lose a domain, it will not come back up until SMS is restarted.

SMS is designed to protect the hardware from getting into a state that can cause permanent damage to the components, like overheating or shorting out. Therefore, if you try to run without any SC monitoring the platform, SMS may take down the domains to prevent possible damage. The following chart describes several different scenarios and what will happen to the domains.

| Example # | MAIN SC State    | SPARE SC State    | Domain State
|-----------+------------------+-------------------+---------------------
| 1 | ok prompt | ok prompt | stay up
| 2 | init 6 | ok prompt | stay up
| 3 | sms stop | powered off | graceful shutdown
| 4 | init 0/shutdown | powered off | graceful shutdown
| 5 | init 6 | powered off | graceful shutdown
| 6 | halt | powered off | stay up
| 7 | send break | powered off | stay up
| 8 | lose power | powered off | global stop


More details:

  1. If both SCs are at the ok prompt, there is still clock source provided to the domains, but there is no monitoring by esmd. However, the domains will stay up and the fans will run at high speed. It is not recommended to leave the platform in this state for long periods of time. The platform is often put into this state for maintenance procedures.
  2. Like example 1, both SCs remain powered on and the domains all stay up.
  3. The sms shutdown script is coded to check to see if both SCs are powered on. If the spare is off and SMS is stopped on the main, all the domains will be gracefully shut down. It's assumed that the other SC is going to be shutdown. This protects the hardware from running without esmd monitoring and also prevents a dstop from clock loss if the main SC is powered off.
  4. When init 0 is run on the main SC, it runs K20sms which is the SMS shutdown script. At this point, it follows the same action as Example 3 in the table above.
  5. init 6 also runs K20sms, so it's the same as Examples 3 & 4.
  6. When you run 'halt' on the main SC, a graceful shutdown does not occur. Therefore, the SC doesn't run any of the shutdown scripts that would cause sms to be shut down. Since the SC goes directly to the ok prompt without running the shutdown scripts, the domains stay up. Again, it's not recommended to leave the platform in this state for an extended period of time.
  7. If a break signal is sent to the MAIN SC (<cr> <~> <ctrl> B), the SC will drop straight to the ok prompt. Therefore, the domains will stay up because the shutdown scripts are bypassed.
  8. If both SCs are powered off suddenly, all of the running domains will dstop due to the loss of clock. This could occur if someone turns off the breakers (which is never recommended) or there is an unexpected failure. The sms software will not allow you to poweroff the MAIN SC via command line.

NOTE: In all instances where the domains are taken down, SMS will automatically restore them to their previous state when it is brought back up. 

Here is a sample of the output produced when the domains are shut down:

# ./sms stop
sms: SMS is being shutdown on the only present and powered on SC.
Sep 15 12:20:16 sc0 sms-svc: sms: SMS is being shutdown on the only present and powered on SC.
All domains are being shutdown gracefully and all boards are being powered off. . .
Sep 15 12:20:16 sc0 sms-svc: All domains are being shutdown gracefully and all boards are being powered off. . .


Product
Sun Fire E25K Server
Sun Fire E20K Server
Sun Fire 15K Server
Sun Fire 12K Server


Internal Section

Keywords: SMS, shutdown, reboot, power, domains

Previously Published As 78150



Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback