Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1020429.1
Update Date:2010-10-19
Keywords:

Solution Type  Problem Resolution Sure

Solution  1020429.1 :   Sun SPARC(R) Enterprise Mx000 (OPL) Servers: Domain hang-up detected (POST)  


Related Items
  • Sun SPARC Enterprise M5000 Server
  •  
  • Sun SPARC Enterprise M3000 Server
  •  
  • Sun SPARC Enterprise M9000-64 Server
  •  
  • Sun SPARC Enterprise M9000-32 Server
  •  
  • Sun SPARC Enterprise M8000 Server
  •  
  • Sun SPARC Enterprise M4000 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>OPL Servers
  •  

PreviouslyPublishedAs
258248


Applies to:

Sun SPARC Enterprise M8000 Server
Sun SPARC Enterprise M4000 Server - Version: Not Applicable to Not Applicable   [Release: NA to NA]
Sun SPARC Enterprise M3000 Server - Version: Not Applicable and later    [Release: NA and later]
Sun SPARC Enterprise M9000-32 Server - Version: Not Applicable and later    [Release: NA and later]
Sun SPARC Enterprise M9000-64 Server - Version: Not Applicable and later    [Release: NA and later]
All Platforms

Symptoms

Domain fails to starts during poweron with the following signatures :


- the 'poweron' sequence stops with no further indication

- error logs :

XSCF> showlogs error -v

Date: May 06 17:35:12 JST 2009     Code: 6000c000-fcff0000-0109000600000000
    Status: Warning                Occurred: May 06 17:35:11.625 JST 2009
    FRU: /DOMAIN#0
    Msg: Domain hang-up detected (POST), DID 0
    Diagnostic Code:
        00000000 00000000 00000000
        00ffffff 00000002 00000000 02000000
        00000000 00000000 00000000 00000000
    UUID: 5b4c24e1-0a77-488c-a85e-b484df2f8fcd MSG-ID:
SCF-8005-RA


- FMA logs :

XSCF> fmdump -v -u 5b4c24e1-0a77-488c-a85e-b484df2f8fcd

May 06 17:35:12.1625 ereport.chassis.domain.post.fe-start-err

May 06 17:35:12.2048 5b4c24e1-0a77-488c-a85e-b484df2f8fcd SCF-8005-RA
  100%  upset.chassis.domain.post

        Problem in: hc:///chassis=0/domain=0
           Affects: -
               FRU: hc://product-id=SPARC Enterprise M5000,chassis-id=BCF082206P,server-id=M5000/component=CHASSIS


- showstatus output :

No failures found in System Initialization.


- the XSB is reported as "Testing" in the showboards output :

XSCF> showboards -v -a
XSB  R DID(LSB) Assignment  Pwr  Conn Conf Test    Fault    COD
---- - -------- ----------- ---- ---- ---- ------- -------- ----
00-0  00(00)   Assigned    y    n    n    Testing Normal   n  
01-0  01(00)   Assigned    y    n    n    Unknown Normal   n  



This problem is mainly reported when the domain is composed of one XSB, either in uni or quad mode.
In a multi-XSB configuration domain (mulitple uni-XSB or quad-XSB), the chances to have one CPU being able to start are higher. This CPU will detect the problem
and the XSCF will report the error via FMA.

Cause

Even if there is no clear log indicting bad HW, such a situation is more likely due to a bad MBU or bad CMU.  However, this issue can also be caused by a firmware problem due to an interrupted firmware upgrade.

This error can be due to several reasons :
- CPU failed to fetch POST code,
- the POST image is corrupted, or an incorrect version for XCP revision that is active
- the active FMEM bank is empty

Solution

1.  Rule out known firmware issues that can cause this error.
There are two different scenarios to check for, first is that the wrong POST version is being used compared to what is active on the XSCF.  To check for this you compare the outputs of "version -c cmu -v" and "version -c xcp -v".
Here is an example where you can see that the POST/OBP version on the CMU/MBU does not match the version that is active on the XSCF:

XSCF> version -c cmu -v
DomainID  0: 02.03.0000
DomainID 1: 02.03.0000
DomainID 2: 02.03.0000
DomainID 3: 02.03.0000
XSB#00-0: 02.03.0000(Current) 01.30.0000(Reserve) <==========HERE
XSB#00-1: 55.55.5535(Current) 55.55.5535(Reserve)
XSB#00-2: 55.55.5535(Current) 55.55.5535(Reserve)
XSB#00-3: 55.55.5535(Current) 55.55.5535(Reserve)
XSB#01-0: 55.55.5535(Current) 01.30.0000(Reserve)
XSB#01-1: 55.55.5535(Current) 55.55.5535(Reserve)
XSB#01-2: 55.55.5535(Current) 55.55.5535(Reserve)
XSB#01-3: 55.55.5535(Current) 55.55.5535(Reserve)
XSCF> version -c xcp -v

XSCF#0 (Active )
XCP0 (Current): 1060 <==========HERE
OpenBoot PROM : 01.30.0000 <==========HERE
XSCF : 01.06.0001
XCP1 (Reserve): 1060
OpenBoot PROM : 01.30.0000
XSCF : 01.06.0001
OpenBoot PROM BACKUP
#0: 01.30.0000
#1: 02.03.0000
XSCF#0 (Active )
01.06.0001(Current) 01.06.0001(Reserve)
To check for the second scenario, which is where the XSCF has the active FMEM on the CMU/MBU set to a FMEM that is empty (blank).  To check for this you need to review the "version -c cmu -v" output carefully.

Here is an example of a system that has the active (Current) FMEM set to an empty (blank) bank:

XSCF> version -c cmu -v

DomainID  0: 02.07.0000
DomainID  1: 02.07.0000
DomainID  2: 02.07.0000
DomainID  3: 02.07.0000
XSB#00-0:  55.55.5535(Current)     02.07.0000(Reserve)
XSB#00-1:  55.55.5535(Current)     02.08.0000(Reserve)
XSB#00-2:  55.55.5535(Current)     02.08.0000(Reserve)
XSB#00-3:  55.55.5535(Current)     02.08.0000(Reserve)
XSB#01-0:  55.55.5535(Current)     02.08.0000(Reserve)
XSB#01-1:  55.55.5535(Current)     02.08.0000(Reserve)
XSB#01-2:  55.55.5535(Current)     02.08.0000(Reserve)
XSB#01-3:  55.55.5535(Current)     02.08.0000(Reserve)

NOTE: Please note, the value "55.55.5535" represents an empty FMEM bank, this is normal for unused XSBs on the system to have this value, it is NOT normal to have this value as the Current active FMEM for a XSB that is configured on the system.

Solution if either symptom match on your system:

If either of these symptoms is found to be existing, to resolve the issue upgrade the firmware via "flashupdate".  This will cause not only the bank switch needed to solve the second scenario, but also resolve the first scenario by loading new POST/OBP into the FMEM on the CMU/MBU.

Note :
force re-flashupdating with the same XCP revision will not provide any relief because it does not cause any bank switch, you must move to a newer revision than currently active on the XSCF.

2.  If step 1 shows none of the symptoms are being experienced on the system, then we need to resolve where the HW issue resides.

2a. invoke 'testsb' against the suspect PSB
2b. if testsb reports an problem then the MBU/CMU reported as faulty will need to be replaced
Example of failing testsb in this context :

XSCF> testsb 1
Initial diagnosis is about to start, Continue?[y|n] :y
SB#01 power on sequence started.
  0end
Initial diagnosis started. [1800sec]
  0..... 30..... 60..... 90.....120.....150.....180.....210...end
Hardware error occurred by initial diagnosis.
SB power off sequence started. [1200sec]
  0end
SB powered off.
XSB  Test    Fault
---- ------- --------
01-0 Failed  Normal
A hardware error occurred. Please check the error log for details.

If testsb fails then the MBU/CMU will be reported as Degraded in the 'showstatus' output.
An HW problem would be reported with the following FMA signature : ereport.chassis.domain.post.fe-rti1-tmo
And in the error logs : Hang-up at testing PSB

Ex :

XSCF> showlogs error -v

Date: Mar 05 16:16:25 UTC 2009     Code: 60000000-96020000-0109401100000000
    Status: Warning                Occurred: Mar 05 16:16:23.335 UTC 2009
    FRU: /MBU_B
    Msg: Hang-up at testing PSB#01
    Diagnostic Code:
        00010000 00000000 00000000
        0100ffff 00000002 00000000 01000000
        00000000 00000000 00000000 00000000
    UUID: 83d8b386-58fe-420a-ba2f-222862fb2ce5 MSG-ID: SCF-8007-UQ

And FMA is able to indict HW :

XSCF> fmdump -v -u
83d8b386-58fe-420a-ba2f-222862fb2ce5

Mar 05 16:16:25.5401 83d8b386-58fe-420a-ba2f-222862fb2ce5 SCF-8007-UQ
  100%  fault.chassis.domain.post

        Problem in: hc:///chassis=0/cmu=1/xsb=0
           Affects: hc:///chassis=0/cmu=1/xsb=0
               FRU: hc://product-id=SPARC Enterprise M5000,chassis-id=BCF084901Q,server-id=aphrodite-sc/:serial=BC08340597:part=CF00541-0478 07   /541-0478-07:revision=0201/component=/MBU_B


Workaround


In some cases, a workaround to this problem could be to force a flashupdate to the latest XCP.

The underlying reason is that the root cause for this problem can be due to CPU that is not able to fetch the POST code from a FMEM bank. So switching to the other bank (this is done during a flashupdate) may provide a relief. If this works then the domain can be brought up.

In any case, the flashupdate will just mask the problem and the problem looks fixed by switching of bank.

If testsb fails then a MBU/CMU replacement must be considered.

Another later flashupdate may cause the system to use the same FMEM bank again and lead to the same bad situation.

If testsb does not report any HW problem or the problem persists after the flashupdate or any deviation to the described behavior, please open a Service Request further analysis.

Another workaround to this problem could be to have multiple XSBs in the domain configuration.

Since the problem is more likely to happen in a single XSB domain configuration then the following may help to start the domain in degraded mode; by creating a multi XSB domain configuration :
- configuring the suspect PSB in quad mode and have the 4 of them configured in the domain,
- adding a uni-XSB to the domain with a lower LSB number than the one assigned to the suspect XSB.

Again, this is just a way to get some more confirmation of an HW error or a workaround to start the domain but if testsb fails or FMA renders some diagnosis then an HW replacement must be considered.

Example where XSB#01-0 was failing to start as the LSB#00 in DID#01.
Configure the PSB01 in quad mode and have the 4 quad-XSB in the domain. The subsequent poweron may fail, indicting some HW, FMA will provide some diagnosis and some components are marked as degraded :

XSCF> showboards -v -a
XSB  R DID(LSB) Assignment  Pwr  Conn Conf Test    Fault    COD
---- - -------- ----------- ---- ---- ---- ------- -------- ----
00-0 * 00(00)   Assigned    y    n    n    Unknown Normal   n  
01-0 * 01(00)   Assigned    y    n    n    Testing Faulted  n  
01-1 * 01(01)   Assigned    y    n    n    Testing Faulted  n  
01-2 * 01(02)   Assigned    y    n    n    Testing Faulted  n  
01-3 * 01(03)   Assigned    y    n    n    Passed  Normal   n  

XSCF> showlogs error -v

May 11 13:12:56.8578 ereport.chassis.SPARC-Enterprise.asic.sc.fe-start-cpu-comm-err

Date: May 11 13:12:56 UTC 2009     Code: 60002000-99020000-02000e0100000000
    Status: Warning                Occurred: May 11 13:12:54.746 UTC 2009
    FRU: /MBU_B
   
Msg: XSB deconfigured (not running)

TIME                 UUID                                 MSG-ID
May 11 13:12:56.9150 cd89bedf-0007-4662-a54a-1f2f2114a7a5 SCF-8005-CA
  100%  fault.chassis.SPARC-Enterprise.asic.sc.fe

        Problem in: hc:///chassis=0/cmu=1
           Affects: hc:///chassis=0/cmu=1
               FRU: hc://:product-id=SPARC Enterprise M5000 :chassis-id=BCF0715003:server-id=v4u-m5000a-xscf-gmp03:serial=BC08340597:part=CF00541-0478 07   \541-0478-07:revision=0201/component=/MBU_B
          Location: /MBU_B


XSCF> showstatus
*   MBU_B Status:Degraded;


It's now clear that the MBU must be replaced.

Note that in this example, it's not possible to boot the domain since there is no devices attaches to 01-3.

References

http://monaco.sfbay.sun.com/detail.jsf?cr=6722798

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback