Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1003436.1
Update Date:2009-09-27
Keywords:

Solution Type  Problem Resolution Sure

Solution  1003436.1 :   Sun Fire[TM] 12K/15K: Post failure on Domain Advanced Tests  


Related Items
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
204819


Symptoms
During a high level HPOST (>64) on a domain with 72 or more processors, a
12K/15K CPU may fail HPOST with the following failure:
Proc SB0/P0 timed out on test Domain Advanced Tests id=0x6F. Test Failed.
FAIL Proc SB0/P0: test_seq_cwd(): failed out of config on timeout
Primary service FRU is Slot SB0.
Proc SB0/P1: Not Good and poll_busy
Proc SB0/P0: EpiDomAdvR1_sc_tfunc(): Master failed
Proc SB0/P1: EpiDomAdvR1_sc_tfunc(): Slave failed
Proc SB0/P0: clear_lpost_mastership(): Called for non-good CPU
Proc SB0/P0: summarize_test_state(): flags !(SC_CODE & LASTTEST): 0x0000
EpiDomAdvR1_sc_tfunc(): Failures occurred. Stage repeat required
Repeating all LPOST stages (1)...
Dstop/Recordstop/Timeout recovery (1); rerun starting at: cpu_lpost

The tests will rerun with the implicated cpu deconfigured and the rerun
tests will pass just fine. The result of post is that the failed cpu (in
this case, cpu0) will be marked failed by post. If the failed cpu is cpu0,
cpu1 will also be "crunched" by post because they share the data path.

NOTE: Similar HPOST behavior has been found to occur on large Jaguar configurations with SMS 1.4.1. Apply same workaround outlined in this document.



Resolution
Confirm the number of cpus in the domain. If there are 72 or greater,
then this is most likely Bug ID 4818581.
The bug states the following:

"When running Hpost level 64 or above, the Domain Advanced Tests (stage
cpu_lpost_II, phase 6, test ID 111) execution time period exceeds the
maximum timeout allowed and the tests fail."

This is not likely to be bad hardware unless this workaround or patch fix
is already implemented. For this error to be a result of bad hardware, the
test should fail at domain configurations of less than 72 cpus or at HPOST
levels less than 64.

To test the hardware sanity, take the board which failed this test and
create its own single board domain and rerun the high level HPOST. If
the board is truly bad, it should not timeout in a single board configuration.
It should actually produce a hard failure.

This bug was fixed in SMS 1.3 HPOST patch 114608-02 and should be
integrated in SMS 1.4.

SMS 1.2 does not have the fix, therefore use the provided workaround
(prefer to encourage an upgrade



Relief/Workaround
Extend the test timeout period by adding the following to the
/var/opt/SUNWSMS/SMS/etc/platform/.postrc file:

poll_timeout_mult 4 # Bugid 4818581

Be aware of the downside of this setting which increases the timeouts by a
factor of four. This means that all tests run much longer before timing out,
thus increasng the possible length of time that the high level HPOST will run.



Product
Sun Fire 15K Server
Sun Fire 12K Server

Internal Comments
Related to this are several bugs:
4454842
4775888
4818581
4851017
6307312

Also see Troubleshooting Article <Document: 1004888.1> for another reason why this workaround may be applicable.
POST, HPOST, CPU, proc, 64, 96, 127, level, jaguar
Previously Published As
72263

Change History
Date: 2005-09-22
User Name: 95826
Action: Approved
Comment: - verified metadata
- changed review date to 2006-09-22
- checked for TM - none added
- checked audience : contract
Publishing
Version: 3
Date: 2005-09-22
User Name: 95826
Action: Accept
Comment:
Version: 0
Date: 2005-09-22
User Name: 101037
Action: Approved
Comment: Addition looks fine
Version: 0
Product_uuid
29e4659c-0a18-11d6-9fa1-e67bbc033df8|Sun Fire 15K Server
077fd4c5-df8f-4320-ad69-7d01603a674d|Sun Fire 12K Server

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback