Domains on Sun Fire [TM] 12k/15k/E20k/E25k may dstop when SMS is simultaneously started on both SCs

Asset ID:	1-71-1006623.1
Update Date:	2010-07-20
Keywords:

Solution Type Technical Instruction Sure

Solution 1006623.1 : Domains on Sun Fire [TM] 12k/15k/E20k/E25k may dstop when SMS is simultaneously started on both SCs

Related Items


Sun Fire E25K Server
 Sun Fire E20K Server
 Sun Fire 12K Server
 Sun Fire 15K Server

Related Categories


GCS>Sun Microsystems>Servers>High-End Servers

PreviouslyPublishedAs
209238

Description

Domains on Sun Fire [TM] 12k/15k/E20k/E25k may dstop due to clock issues when SMS is simultaneously started on both SCs.

Steps to Follow

The system controller (SC) in Sun Fire high-end systems is a multifunction, CP1500- or CP2140-based printed circuit board (PCB) that provides critical services and resources required for the operation and control of the Sun Fire system. The SCs provides several services for the Sun Fire system via the System Management Services (SMS). Among these services, the SCs provides:

- a synchronized hardware clock source.
- redundancy and automated SC failover in dual-SC configurations.

The SC that controls the platform is referred to as the main SC, while the other SC acts as a backup and is called the spare SC.

When SMS is simultaneously started on both SCs, they will both start their daemons in the "main" mode. During the startup period, the main SC will reprogram the boards to select its clock as input. As both SCs believe that it is the main SC, the synchronization of the clocks is not guaranteed (the spare SC is supposed to "follow" the main SC's clock) and they will both try to reprogram the boards.

The SMS software check for the synchronization of SCs'clocks before reprogramming the boards but the programming process may take a couple of seconds and the clocks may get out of synchronization during this small window. In this case, the domains using the reprogrammed boards may dstop due to clock issues.

After a short period, the SMS software will detect that both SCs are trying to become the main SC and SC0 will reset SC1 to prevent what is called a split brain situation. But the detection occurs too late to prevent the potential issue of getting the clocks out of synchronization during the reprogramming of the boards.

As a result, we do always recommend to start SMS on the previous main SC first, wait for the completion of the sms startup process and then start SMS on the spare SC.

For example, from the README file of the SMS patches:

Special Install Instructions:
-----------------------------
Follow these steps when installing on the SC:
1. Record which SC is the main SC.
2. Disable failover on MAIN SC (setfailover off).
3. Stop the SMS processes on both SC's simultaneously.
   /etc/init.d/sms stop
4. Install the patch on both SC's.
5. Start the SMS processes on the previous main SC first.
   /etc/init.d/sms start
6. After all the sms processes have started (i.e. you're able to run the
   showenvironment command and get all the system's status),
   start the SMS processes on the Spare SC next.
7. Enable failover on MAIN SC (setfailover on).

Product
Sun Fire E25K Server
Sun Fire E20K Server
Sun Fire 15K Server
Sun Fire 12K Server

Internal Comments
1) System Clock Background

===========================

Each System Controller (SC) has several clock frequency generators.

During normal operation, the SCs use the 75Mhz voltage controlled crystal oscillator (VCXO) to source clocking for the entire platform.
The control voltage circuit of the VCXO either uses the output of a phase-frequency comparator or a fixed voltage reference to drive the VCXO.

The selection of the reference is done via a software write to a register (REF_SEL) in the Gchip.

When REF_SEL=1 (this is called 'leader mode'), the SC use the fixed voltage reference to drive the VCXO.

When REF_SEL=0 (this is called 'follower mode'), the SC compares its clock against the clock coming from the other SC and uses the error signal resulting from this comparison to drive the VCXO. This cause the SC clock to follow the other SC clock.

By setting the MAIN SC REF_SEL to 1 and the SPARE SC REF_SEL to 0, the two SCs will generate identical clocks signals. This is called phase lock and this is necessary to ensure a safe clock failover.

When the SCs are both in 'leader mode' or both in 'follower mode', the phase lock is not guaranteed.
2)Failover Management Daemon (fomd) Background

===============================================

When fomd is started, its state is UNKNOWN and its initial task is to determine the role of the SC.

In order to determine it's own role, fomd must determine the state of the opposite SC. This is accomplished by checking if the opposite SC is responding to a Remote Procedure Call (RPC) via the I2 network and/or is generating a hardware heartbeat
(by checking the interrupt status register in its SBBC).

If the opposite SC is running as MAIN, the starting fomd becomes SPARE.

If the opposite SC is not producing a heartbeat, the starting fomd will assume the main role. It will change its state to BECOMING_MAIN, will start generating a heartbeat and will instruct the Hardware Access Daemon (hwad) to change its state to MAIN.

When SMS is started simultaneously on both SCs, the check for a heartbeat will fail and both SCs will proceed to BECOMING_MAIN (and both SCs will instruct hwad to become MAIN).

To prevent a split brain (both SCs are in the MAIN role), SC0 will try again to determine the role of SC1 after entering the BECOMING_MAIN state.

If SC1 is not Spare at this time, SC0 will reset SC1 and will instruct hwad to restart the MAIN initialization to make sure it undoes anything the spare might have done before it got reset.

3) HWAD Background
===================

When hwad is becoming MAIN, it first configures the SC in 'follower mode' (setting REF_SEL to 0), then it reads the device presence registers to decide what boards are present and register these boards.

If the SC is able to register all boards and if the clocks are phase locked, the SC will program the Smart Phased Lock Loop (SPLL) on each boards to select its clock as input.

4) Potential clock issue when SMS is started simultaneously on both SCs
========================================================================

When both SCs are started simultaneously, the check by fomd for a heartbeat will fail and both SCs will instruct hwad to become MAIN.
Both SCs will go into 'follower mode' and the phase lock is not guaranteed anymore.

Hwad will check if the clocks are phase lock before reprogramming the SPLL on the boards but it may take a couple of seconds to reprogram the SPLLs.

There is a small chance to loose the phase lock during the reprogramming of the SPLLs. In this case, if the SPLL was previously programmed to use the clock from the other SC, the domain using the reprogrammed board may dstop due to clock issue.

5) Example
===========
SC0 is MAIN, SC1 is SPARE (hence all boards should use SC0's clock)
--------------------------------------------------------------------
May 21 20:45:47 2005 cmc5asffsc0 fomd[730]: [8576 739306450286 NOTICE; FailoverMgr.cc 2354]
SC configured as Main
Jul 9 09:42:26 2005 cmc5asffsc0 fomd[20278]:
[8576 4194538199056810 NOTICE FailoverMgr.cc 3071] SC configured as Main

May 21 20:48:19 2005 cmc5asffsc1 fomd[725]: [8577 74116830583 NOTICE FailoverMgr.cc 3071]
SC configured as Spare
Jul 9 09:51:23 2005 cmc5asffsc1 fomd[735]:
[8577 455804777482 NOTICE FailoverMgr.cc 3074] SC configured as Spare

Around Jul 9 09:30, Failover is deactivated and SMS is stopped on both SCs
---------------------------------------------------------------------------
Jul 9 09:30:28 2005 cmc5asffsc0 fomd[730]: [8519 4193820587477449 NOTICE FailoverMgr.cc 2507]
Failover deactivated Jul 9 09:30:57 2005 cmc5asffsc0 ssd[671]:
[1302 4193849783186025 WARNING SSDApp.cc 209] SMS soft shutdown, signaling all components:
signal=SIGTERM

Jul 9 09:31:00 2005 cmc5asffsc1 ssd[666]:
[1302 4193069493821787 WARNING SSDApp.cc 209] SMS soft shutdown, signaling all components:
signal=SIGTERM

At Jun 9 09:40, SMS is started at (nearly) the same time on both SCs.
----------------------------------------------------------------------
Jul 9 09:40:08 2005 cmc5asffsc0 ssd[20243]: [0 4194400749612002 NOTICE SSDWorkArea.cc 38]
SMS 1.4.1 start-up initiated
Jul 9 09:40:16 2005 cmc5asffsc1 ssd[11168]:
[0 4193625224077750 NOTICE SSDWorkArea.cc 38] SMS 1.4.1 start-up initiated
SC0 failed to register all objects due to I2c read timeouts; hence SC0 will not try to reprogram
the SPLLs ... but SC1 had no issue and will reprogram the SPLLs to use SC1's clock.
--------------------------------------------------------------------------------

Jul 9 09:40:14 2005 cmc5asffsc0 fomd[20278]: [8599 4194405919439452 NOTICE FMHeartbeat.cc 217]
Checking for SC heartbeat interrupts (can take up to 30 seconds) ...
Jul 9 09:40:33 2005 cmc5asffsc0 hwad[20252]: [1123 4194425041843766 ERR I2cComm.cc 716]
I2c read time out bus: 0, address: 23; Jul 9 09:40:33 2005 cmc5asffsc0 hwad[20252]:
[1123 4194425839684386 ERR I2cComm.cc 716] I2c read time out bus: 1, address: 23
Jul 9 09:40:34 2005 cmc5asffsc0 hwad[20252]: [1123 4194426341118214 ERR I2cComm.cc 716]
I2c read time out bus: 2, address: 23
Jul 9 09:40:34 2005 cmc5asffsc0 hwad[20252]:
[1123 4194426842412128 ERR I2cComm.cc 716] I2c read time out bus: 3, address: 23
Jul 9 09:40:35 2005 cmc5asffsc0 hwad[20252]: [1123 4194427343821898 ERR I2cComm.cc 716]
I2c read time out bus: 4, address: 23
Jul 9 09:40:35 2005 cmc5asffsc0 hwad[20252]: [1123 4194427845350774 ERR I2cComm.cc 716]
I2c read time out bus: 5, address: 23
Jul 9 09:40:36 2005 cmc5asffsc0 hwad[20252]: [1123 4194428346558061 ERR I2cComm.cc 716]
I2c read time out bus: 6, address: 23
Jul 9 09:40:36 2005 cmc5asffsc0 hwad[20252]: [1123 4194428847724543 ERR I2cComm.cc 716]
I2c read time out bus: 7, address: 23
Jul 9 09:40:37 2005 cmc5asffsc0 hwad[20252]: [1123 4194429348887890 ERR I2cComm.cc 716]
I2c read time out bus: 8, address: 23
Jul 9 09:40:37 2005 cmc5asffsc0 hwad[20252]: [1123 4194429850168880 ERR I2cComm.cc 716]
I2c read time out bus: 9, address: 23
Jul 9 09:40:38 2005 cmc5asffsc0 hwad[20252]: [1123 4194430351300531 ERR I2cComm.cc 716]
I2c read time out bus: 10, address: 23
Jul 9 09:40:38 2005 cmc5asffsc0 hwad[20252]: [1123 4194430853306493 ERR I2cComm.cc 716]
I2c read time out bus: 11, address: 23
Jul 9 09:40:39 2005 cmc5asffsc0 hwad[20252]: [1123 4194431354660817 ERR I2cComm.cc 716]
I2c read time out bus: 12, address: 23
Jul 9 09:40:39 2005 cmc5asffsc0
dstop, SMS, , simultaneously, started, clock
Previously Published As
82352

Change History
Updated by the ESG Knowledge Content Team 4/2010
Attachments

This solution has no attachment