Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1012044.1
Update Date:2011-05-26
Keywords:

Solution Type  Problem Resolution Sure

Solution  1012044.1 :   Sun Fire[TM] 12K/15K: Recognizing SMS split brain  


Related Items
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
216503


Symptoms

- Problem statement:

Under rare circumstances SMS has been found to suffer a split brain condition. When this occurs both system controllers assume the Main role, and multiple (if not all) domains crash.

- Symptoms:

The split brain condition has only been observed when SMS is restarting on one of the SCs. Neither SC is able to detect the other and both proceed to assume the Main role. The simplest way to find this condition is to run the SMS ' showfailover -r ' command on both SCs. They should both report the Main role as shown below.

xc46-sc1:sms-svc:17> showfailover -rMAIN

It is also possible to see this condition by searching the platform message logs (/var/opt/SUNWSMS/adm/platform/messages) for the message " SC configured as Main ". In the example messages below, we can see that both SCs assumed the Main role very close in time. Neither states that it is " configured as Spare ".

SC0---->
Nov 4 02:18:36 2002 ssscpsfp-sc1 fomd[464]: [8576 2729512887571282 NOTICE FailoverMgr.cc 2103] SC configured as Main
SC1---->
Nov 4 02:17:03 2002 ssscpsfp-sc2 fomd[476]: [8576 181482868172 NOTICE FailoverMgr.cc 2885] SC configured as Main

Other messages that may lead up to the assumption of the Main role include:

Nov 4 02:09:13 2002 ssscpsfp-sc1 fomd[464]: [8599 2728949610724982 NOTICE FMHeartbeat.cc 232] Checking for SC heartbeat interrupts (can take up to 15 seconds) ...
Nov 4 02:09:28 2002 ssscpsfp-sc1 fomd[464]: [8582 2728964770234452 NOTICE FMHealthMonitor.cc 184] Not detecting remote SC's heartbeat interrupts

Subsequent messages typically show numerous console bus failures. This is because we have two SCs trying to access and control the same hardware. The messages might appear as follows:

Nov 4 02:18:36 2002 ssscpsfp-sc1 hwad[435]: [1174 2729513002562554 ERR PciComm.cc 232] console bus device failed to respond correctly at address 213052da
Internal Resolution
There have been various split grain bugs in several versions of SMS.
See the PTS website for a list of the SF12K/15K/E20K/E25K suggested patches.
http://panacea.uk.oracle.com/twiki/bin/view/Products/Last_ProdPatchesFirmwareStarcat
Internal Comments

Previously Published As
50767
Product_uuid
29e4659c-0a18-11d6-9fa1-e67bbc033df8|Sun Fire 15K Server
077fd4c5-df8f-4320-ad69-7d01603a674d|Sun Fire 12K Server

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback