Sun Fire[TM] Midrange and HighEnd Servers: Discussion of Component Health Status (CHS)

Asset ID:	1-71-1004845.1
Update Date:	2010-06-18
Keywords:

Solution Type Technical Instruction Sure

Solution 1004845.1 : Sun Fire[TM] Midrange and HighEnd Servers: Discussion of Component Health Status (CHS)

Applies to:

Sun Fire 4800 Server
Sun Fire V1280 Server
Sun Fire 6800 Server
Sun Fire 12K Server
Sun Fire 15K Server
All Platforms

Goal

Description
This document provides information about the advantages of Component Health Status (CHS) on Sun Fire[TM] Midrange and HighEnd Servers. This includes a discussion on life before and after CHS.

CHS is only available on the servers listed below:

Midrange: Sun Fire[TM] 3800, 4800, 4810, 6800, v1280, Netra 1280, E2900, E4900, & E6900
HighEnd: Sun Fire[TM] 12K, 15K, E25K, & E20K

CHS was first integrated in:

System Controller Application (ScApp) 5.15.0 on Sun Fire MidRange Servers.

5.17.0 is the minimum firmware release for the E2900
5.16.0 is the minimum firmware release for E4900/E6900, so technically these firmware versions are also the earliest integration of CHS for these specific platforms.

System Managment Services (SMS) 1.4 on Sun Fire HighEnd Servers.

SMS 1.5 is the minimum SMS version for the E20K and E25K, so technically this is the first integration of CHS for these specific platforms.

Solution

Steps to Follow

Philosophy of CHS: What does it do?

Auto-Diagnosis and Recovery (ADR or AVL) enhancements include features for ScApp and SMS that improve the platform's Availability - the ability for a system to recover and stay in production following hardware issues. The enhancements included are:

Diagnostic Engines which analyze error events for root cause.
Event Messaging which report the error events.
Recovery actions (like a panic reboot, as well as steps to prevent repeat errors).

Component Health Status is the Auto-Diagnosis and Recovery mechanism used to prevent repeat failures. In very general terms, when a hardware device is determined by the Diagnostic Engine to be root cause to an error, that Component's "Health Status" is marked "faulty". POST has been enhanced to read the component health status at the very beginning of a POST cycle. Any components marked faulty are excluded from even being tested by POST. They are ignored from entry to the domain configuration completely, thus preventing any repeat failures as a result of this same component problem.

What happened prior to CHS?

Before the CHS enhancements, if a component errored while a domain was up (in Solaris[TM]) and the domain rebooted (perhaps panic reboot), the next POST cycle would have to fail that component in order to prevent it from being configured back in the domain.

The problem with requiring POST to fail the component is that POST simply doesn't detect all fault events all the time for all devices. The reality is that many faulty components "pass" POST upon reboot or similar, and are allowed back into the domain configuration. In some cases, they error and panic again, severely limiting Availability.

Additionally, prior to CHS, a component that failed post during a POST cycle was "removed" from the configuration only until a new POST cycle was executed. If that component happened to "pass" (more appropriate to call it "not fail") the next POST cycle, the component was back into the domain (to possibly fail again). Obviously for these two reasons, the behavior was not perfect and could be improved to truly isolate out the offending hardware. CHS was that improvement.

Is the improvement better?

Absolutely CHS is an improvement. As stated above, before CHS, the limitations were:

1) POST had to fail a device to isolate it from the configuration.

POST can't find all faults all the time, because many faults are transient in nature.
Do to the transient nature of errors and the requirement to isolate only on POST faults, prevention of repeat errors is not guaranteed.

2) POST testing may not fail the device consistently.

For the same reasons as in #1, the transient nature of errors, POST can not guarantee to find all errors every time it is executed.
If a POST failure is detected in the first POST cycle, the device implicated is removed from the configuration only until the next POST cycle. If the next POST cycle does not fail the device, it's in the configuration again. Perhaps it will error again, and cause a repeat domain failure.

3) The process to completely isolate a faulty device was manual.

If a device was to be completely isolated from a configuration to prevent repeat failures the only option was to manually disable the device through blacklist mechanisms (disablecomponent command).
This means that scapp and SMS are told to ignore the device because it is in a file that they use as reference. If SB5 is defective and the admin wishes to isolate it, that device is entered in the blacklist file and POST ignores this for all time.
If the same SB5 is moved to slot 4, POST will now include the device, because it was told to ignore slot 5, not 4. The device may now cause a repeat failure.

CHS resolves all of these limitations. It takes immediate action to diagnose the fault, label the root cause suspect as faulty and automatically prevent that device from future configuration in the domain. The hardware is updated, so even if the defective board moves to a different slot or server, it remains disabled by CHS. This increases availability because the error is prevented from repeating.

Safari Port Descriptor (SPD)

A data structure is created and initialized by POST for each safari port (CPU/IOC). The structure is located in each SBBC's Boot Bus SRAM. Typically, there are 2 SPDs per SBBC. For a CPU board, there would be 4 SPDs. For an I/O board, there would be 2 SPDs. SPDs contain the following information:

1. Port Status Summary		{pass|fail|unknown|blacklisted}
2. Port Type			{cpu|pci|pcic|sbus}
3. Port Attributes		{rated frequency,actual frequency,cache size, version}
4. Bank Status			{pass|fail|unknown|unpopulated|config-error|spare|blacklisted}
5. DIMM JDEC code		For each of the 16 DIMMs
6. Port Component Status	{pass|fail|unknown|blacklisted} (for each component)
7. I/O Bus Status		{pass|fail|unknown|blacklisted}
8. IO Card(s) Status		{pass|fail|unknown|blacklisted}
9. Checksum

Seeprom has 3 records in the Dynamic FRUID section (DFRUID) to support CHS:

StatusCurrent This record is used to provide the current state of the FRU component.
StatusEvents This record is used to maintain the history of the status changes.
StatusProxy These records are used to represent the state of components that do not contain their own seeprom (do not contain FRUID).

There are 3 ways to change the Status records:

Manually with the setchs command.
Using the AD (Aauto-Diagnosis) engine on the SC
Using POST (Power On Self Test)

CHS commands:

showchs display CHS information
setchs change CHS information

NOTES:

On Midrange Servers setchs is available in "normal" mode in 5.20.15 and higher.
It's only available in service mode in ScApp 5.20.14 or lower (and 5.21.x is LOWER).

See 1004879.1 for details on resetting the CHS status if needed.

On HighEnd Servers, there is not a "service mode", so no special password is required.

Instead the CHS commands are "undocumented" (no MAN page or help page) and they can be executed as user sms-svc. They are intended for use with Support Services recommendation only.

Tunables:

ad-chs-enabled - If true, the AD (autodiagnosis) engine will record CHS info if an error occurs.
error-policy - Can be set to either display or diagnose. If set to diagnose (default) the AD will
diagnose the error.
post-chs-read - If true, POST will consult CHS, else it will ignore it.
post-chs-write - If true, the POST will update CHS if it finds an error, else it will not commit the CHS changes.

CHS: Failure to record event.

When the CHS of a component is changed, two records in different segments of DFRUID have to be updated:

StatusCurrentR
StatusEventR.

Since these 2 segments are different, it is possible that there will be a power loss or ScApp/SMS reboot after the StatusCurrentR record is updated, but before the StatusEventR is updated. This is very rare, but if it does happen, the reason for the change will not be captured.

CHS's Three Component "states"

OK - The component,and related components, have no reported errors associated to them.
SUSPECT - The part of a suspect list of components that caused a fault.
FAULTY - The component has been identified as the sole culprit of a fault.

ScApp/SMS and POST can only "downgrade" the CHS of a component (mark faulty, suspect). A service engineer can manually "upgrade" or "downgrade" the CHS if necessary (mark ok, suspect, or faulty).

There is no relationship between seeprom-based CHS and ScApp/SMS based blacklisting (enablecomponent/disablecomponent). A change in CHS will not modify the ScApp/SMS based blacklists. A change in the ScApp/SMS blacklist will not modify CHS.

Location of CHS Recording

CHS records are in the seeprom chips of the following devices:

System Boards
I/O Boards
Main Memory DIMMS
L2SRAM DIMMS
Repeater (RP)
Power Supply Unit (PSU)
Fan Trays

CPU chips do not have their own seeprom and hence do not have their own CHS. Rather a "proxy record" is on the seeprom of the system board.

Fan Trays and PSU's only provide CHS records for the SSE to manually tag it as bad for a note to the repair depot or other service people. The system ignores the CHS status of PSU's and Fan Trays'

The CHS for RP's will only get set to "suspect" automatically by the system. It will never get set to "Faulty"

Examples

See 1004879.1 for details on resetting the CHS status on Midrange servers, including the procedure for using setchs if a service password is needed or if it is not.

This how to set the status of sb2 to faulty with the reason being "bad sb" on a MidRange server:

6800-sc0:SC> setchs -s faulty -r "bad sb" -c sb2

This is how to display CHS status on the platform:

6800-sc0:SC> showchs -b

Component           Status
---------------     --------
/N0/SB2             Faulty

It is important to remember that CHS information is recorded to a component's seeprom. As long as that component remains in the system, the history of CHS events remains with it as well. Anytime automatic or manual action is taken to change the component CHS status, this history is updated to reflect the "new" event that relates to the component. When/if a component is replaced, the history leaves with the component.

This is how to display a particular component's CHS history (Notice that the latest event in this history is the manual "FOO" event from the above command):

6800-sc0:SC[service]> showchs -v -c sb2
Total # of records: 8

     Component           : /N0/SB2
Time Stamp          : Fri Jan 30 16:01:32 PST 2005
New Status          : Faulty
Old Status          : OK
Event Code          : Other
Initiator           : Customer
Message             : FOO

     Component           : /N0/SB2
Time Stamp          : Tue Jun 24 07:35:59 PDT 2004
New Status          : OK
Old Status          : Suspect
Event Code          : Other
Initiator           : Customer
Message             : test

     Component           : /N0/SB2
Time Stamp          : Tue May 27 16:36:40 PDT 2004
New Status          : Suspect
Old Status          : Suspect
Event Code          : None
Initiator           : ScApp
Message             :
1.SF6800.FAULT.ASIC.AR.ADR_PERR.10441008.16-0.2.5014953000619

     Component           : /N0/SB2
Time Stamp          : Tue May 27 15:20:35 PDT 2004
New Status          : Suspect
Old Status          : Suspect
Event Code          : None
Initiator           : ScApp
Message             :
1.SF6800.FAULT.ASIC.AR.CMDV_SYNC_ERR.102410bf.16-0.5.5014953000619.5014404012558.5014404028276.5014362004314

Internal Only NOTE:
The last two CHS history records are from the Auto-Diagnosis engine. The message is an event
code (SF6800.FAULT.ASIC.xxxxxxxx). This code represents a particular failure type, and can
be looked up in the Sun System Handbook Auto Diagnosis and Recovery Fault Tables.

As a result of CHS disabling a component, setkeyswitch operations will report:

6800-sc0:A> setkeyswitch on
Powering boards on ...
Jan 30 16:09:49 6800-sc0 Domain-A.POST: Agent {/N0/SB2/P0}  is CHS disabled.
Jan 30 16:09:50 6800-sc0 Domain-A.POST: Agent {/N0/SB2/P1}  is CHS disabled.
Jan 30 16:09:51 6800-sc0 Domain-A.POST: Agent {/N0/SB2/P2}  is CHS disabled.
Jan 30 16:09:52 6800-sc0 Domain-A.POST: Agent {/N0/SB2/P3}  is CHS disabled.
Jan 30 16:09:54 6800-sc0 Domain-A.SC: Excluded unusable, unlicensed, failed or disabled board: /N0/SB2
Testing CPU Boards ...
{/N0/SB0/P0} Running CPU POR and Set Clocks
{/N0/SB0/P1} Running CPU POR and Set Clocks
...

The ScApp command showcomponents would also show the CHS status of the component (All subcomponents on the board are marked disabled as well):

6800-sc0:A> showcomponent sb2
Component           Status   Pending  POST   Description
---------           ------   -------  ----   -----------
/N0/SB2/P0          disabled -        chs    UltraSPARC-III+, 900MHz, 8M ECache
/N0/SB2/P1          disabled -        chs    UltraSPARC-III+, 900MHz, 8M ECache
/N0/SB2/P2          disabled -        chs    UltraSPARC-III+, 900MHz, 8M ECache
/N0/SB2/P3          disabled -        chs    UltraSPARC-III+, 900MHz, 8M ECache
/N0/SB2/P0/B0/L0    disabled -        chs    empty
/N0/SB2/P0/B0/L2    disabled -        chs    empty
/N0/SB2/P0/B1/L1    disabled -        chs    empty
/N0/SB2/P0/B1/L3    disabled -        chs    empty
/N0/SB2/P1/B0/L0    disabled -        chs    empty
/N0/SB2/P1/B0/L2    disabled -        chs    empty
/N0/SB2/P1/B1/L1    disabled -        chs    empty
/N0/SB2/P1/B1/L3    disabled -        chs    empty
/N0/SB2/P2/B0/L0    disabled -        chs    empty
/N0/SB2/P2/B0/L2    disabled -        chs    empty
/N0/SB2/P2/B1/L1    disabled -        chs    empty
/N0/SB2/P2/B1/L3    disabled -        chs    empty
/N0/SB2/P3/B0/L0    disabled -        chs    empty
/N0/SB2/P3/B0/L2    disabled -        chs    empty
/N0/SB2/P3/B1/L1    disabled -        chs    empty
/N0/SB2/P3/B1/L3    disabled -        chs    empty

This example shows how to clear chs after replacing the bad component with a good one (Again, utilize 1004879.1 for the full process):

sc1:SC> setchs -s OK -r "good"-c SB3/p0   <--- Clears CHS
sc1:SC> showchs -v -c SB3
Total # of records: 3
Component           : /N0/SB3/p0         
Time Stamp          : Mon Jan 19 17:09:41 PST 2004
New Status          : OK       
Old Status          : Faulty   
Event Code          : None      
Initiator           : fieldeng
Message             : good

Component           : /N0/SB3/p0         
Time Stamp          : Fri Jan 16 14:32:14 PST 2004
New Status          : Faulty   
Old Status          : Faulty   
Event Code          : None      
Initiator           : POST    
Message             : 1.SF6800.FAULT.POST.LPOST.61--.15-2.1

Attachments

This solution has no attachment