Sun Enterprise[TM] 10000: Troubleshooting Arbstop Dumps

Asset ID:	1-75-1321328.1
Update Date:	2011-05-25
Keywords:

Solution Type Troubleshooting Sure

Solution 1321328.1 : Sun Enterprise[TM] 10000: Troubleshooting Arbstop Dumps

Applies to:

Sun Enterprise 10000 Server - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Information in this document applies to any platform.

Purpose

This document contains troubleshooting steps for various types of Arbstop Dumps.

Last Review Date

May 11, 2011

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

wfail reports Coherent error processor X (example Arbstop)

wfail reports Illegal Coherent condition/access proc X

From the wfail output, we see something like the following:

redxl> wfail
(output omitted)
CIC   6.0   ErrFlags[61:0] = 01100000 00000000   (after mask)
        ErrFlag[52]: Coherent error processor 0
        ErrFlag[56]: Illegal Coherent condition/access proc 0
FAIL Proc 6.0 in all configs using CIC0: : Arbstop detected by cic
(output omitted)

Check the messages files for voltage related messages around the time of the failure, such as:

Warning: Voltage readings have exceeded the thresholds on system board X

These failure condition have been seen in relation to power puck failures on system boards. If so, replace the system board.

wfail reports DTag Parity Error (example Arbstop)

From the wfail output, we see something like the following:

redxl> wfail
(output omitted)
CIC   1.2   ErrFlags[61:0] = 00000000 00000100   (after mask)
        ErrFlag[11:0]: DTag Parity Error[11:0] = 100 = 0 4 0 0
FAIL Proc 1.2 in all configs using CIC2: : Arbstop detected by cic
        (*** NOTE: Implicated FRU is sysboard 1)
(output omitted)

The system board is the component with the error (indicated with the Implicated FRU is sysboard message). A processor is failed because the DTags are divided into four sections, one section for each processor. hpost knows what section of the DTag failed and thus only deconfigures the processor that would use that section.

wfail reports PC CSR address error (example Arbstop)

From the wfail output, we see something like the following:

redxl> wfail
(output omitted)
PC    0.1   ErrFlag[1:0][31:0] = 00800000 00000000   (after mask)
        ErrFlag1[23]: PC CSR address error
                ErrorData [54:0] = block,write,addr[40:4],bytemask[15:0]
                nonblock read, PA[40:0] = 1FF F4000190, bytemask = 0F00
FAIL PC 0.1: Arbstop detected by pc.
(output omitted)

The bad component is generally the PC on the system board, thus the implicated FRU is the system board. But, there have been cases of improper CPU torque causing these errors.

The best course of action:

Check the torque on the CPUs
Run an hpost -l32 on the hardware
If both CPUs attached to the PC fail, or hpost does not fail at all, replace the system board
If one CPU fails (assuming both present) move the failing CPU, run hpost again, and see if the problem follows

The above assumes a maintenance window availabe to execute tests.

wfail reports PLL Lock Error (example Arbstop)

From the wfail output, we see something like the following:

redxl> wfail
(output omitted)
PC    0.1   ErrFlag[1:0][31:0] = 00400000 00000000   (after mask)
        ErrFlag1[22]: PLL Lock Error
FAIL PC 0.1: Arbstop detected by pc.
MC   0     ErrFlags[47:0] = 8000 01010000
        ErrFlag[16]: PUP0 Internal error
        ErrFlag[24]: PLL Lock Error
        ErrFlag[47]: Repeated errors of the same type occurred
(output omitted)

Most likely, multiple ASICs will be reporting this error. Additional errors may also be reported.

PLL errors involve a interruption of clock signal. Actions that can cause an interruption of clock are:

hpost -C performed on the platform (centerplane reconfiguration)
Loss of power to the active control board, or either centerplane support board (i.e. an oops on a circuit breaker)
An ailing control board or centerplane support board

Rule out the human factors first and check the message logs on the SSP that a hpost -C wasn't performed. Then, moving on...

Clock is driven by the control board and passed through the two centerplane support boards. By examining the GDARBs, we can determine if the error is on one or both centerplane halves.

redxl> shgda -e 0
GDARB 0   Component ID = 14197049
        TransgressErr[15:0] = 0000   Target[15:0] = 0000
        Sysboard Queue Overflow Error Mask [15:0] = 0000
        SysBrd Request Parity Error Mask [15:0]   = 00B0
        Brd 4 {par,req[5:0]} = 00
        Pll Error bit = 1
redxl> shgda -e 1
GDARB 1   Component ID = 14197049
        TransgressErr[15:0] = 0000   Target[15:0] = 0000
        Sysboard Queue Overflow Error Mask [15:0] = 0000
        SysBrd Request Parity Error Mask [15:0]   = 00B0
        Brd 4 {par,req[5:0]} = 00
        Pll Error bit = 1

In this case, both the low (GDARB 0) and high (GDARB 1) halves of the centerplane report the PLL error. This would implicate the control board providing clock is the source of the failure. If one or the other but not both GDARBs reported a PLL Error, it would implicate a centerplane support board. To summarize:

GDARB 0 Pll Error bit	GDARB 1 Pll Error bit	Implicated FRU
1	0	Centerplane Support Board 0, which provdes clock to the low half of the centerplane. Replace CSB0.
0	1	Centerplane Support Board 1, which provdes clock to the high half of the centerplane. Replace CSB1.
1	1	The active Control Board, which provides clock to both Centerplane Support Boards. Switch to the alternate control board and replace the original control board. To determine the active control board, examine the `/var/opt/SUNWssp/.ssp_private/cb_config` file. The active control board is followed by a `P`.

wfail reports Port x UPA fatal error (example Arbstop)

wfail reports Port x UPA parity error

From the wfail output, we see something like the following:

redxl> wfail
(output omitted)
PC    2.1   ErrFlag[1:0][31:0] = 00000000 00000010   (after mask)
        ErrFlag0[4]: Port 0 UPA fatal error
FAIL proc 2.2: Arbstop detected by pc.
(output omitted)

A UPA fatal error is defined as the presence of an illegal state detected by the UPA port controller on the system board. On the system board, each CPU lives on a UPA port and each I/O Controller lives on a UPA port. The UPAs are controlled by the Port Controller (PC) ASICs. Each PC controls 2 CPUs or 2 IOCs, one per UPA port:

Port Controller	Port 0	Port 1
0	CPU 0	CPU 1
1	CPU 2	CPU 3
2	IOC 0	IOC 1

redx, using the PC and UPA port information, kindly reports in the FAIL information the CPU controlled by the failing port.

The failure can be anywhere along the UPA bus path, which include the CPU module, the system board, and the PC. In the case of an IOC, the I/O Controllers are involved as well as the I/O mezzanine (IOCs live on the mezzanine board). Based on failure data involving CPU UPA Parity errors, the CPU is responsible for the invalid state. Perform the following steps:

Follow best practices
Replace the CPU reported in wfail, or in the case of IOCs, the I/O mezzanine on the system board reported in wfail
If problems continue to persist, replace the system board

wfail reports PUPx Internal error (example Arbstop)

From the wfail output, we see something like the following:

redxl> wfail
(output omitted)
MC   3     ErrFlags[47:0] = 8000 00040000
        ErrFlag[18]: PUP2 Internal error
        ErrFlag[47]: Repeated errors of the same type occurred
FAIL Mem group 3.0: : Arbstop detected by mc
FAIL Mem group 3.2: : Arbstop detected by mc
(output omitted)

redxl> shpup 3
Note: Data is displayed from the currently loaded dump file.
PUP   3.0   Component ID = 14339049    Config[11:0] = C62
        ErrFlags[34:0] = 3 10000000   Mode = ErrFlags[33:32] = 3 = PUP 
        Recordstop_mask[3:0] (ErrFlag[31:28]) = 1
PUP   3.1   Component ID = 14339049    Config[11:0] = C62
        ErrFlags[34:0] = 3 10000000   Mode = ErrFlags[33:32] = 3 = PUP 
        Recordstop_mask[3:0] (ErrFlag[31:28]) = 1
PUP   3.2   Component ID = 14339049    Config[11:0] = C62
        ErrFlags[34:0] = 7 10000000   Mode = ErrFlags[33:32] = 3 = PUP 
        ErrFlag[34]: pll lock error
        Recordstop_mask[3:0] (ErrFlag[31:28]) = 1
PUP   3.3   Component ID = 14339049    Config[11:0] = C62
        ErrFlags[34:0] = 3 10000000   Mode = ErrFlags[33:32] = 3 = PUP 
        Recordstop_mask[3:0] (ErrFlag[31:28]) = 1

In this case, only PUP 3.2 reports an error. The PUPs reside on the the memory mezzanine, and is the first choice for FRU replacement. If all PUPs report problems, there may be a problem with the system board itself.

wfail reports Uncorrectable ECC Error (UE) Processor X Dtags (example Arbstop)

From the wfail output, we see something like the following:

redxl> wfail
(output omitted)
  CIC   7.2   ErrFlags[61:0] = 00000001 00000011   (after mask)
          ErrFlag[4]: Uncorrectable ECC Error (UE)  Processor 0 Dtags
          ErrFlag[32]: Repeated Error
  FAIL Proc 7.0 in all  configs using CIC2: : Arbstop detected by cic
          (*** NOTE: Implicated FRU is sysboard 7)
  CIC   7.2   ErrFlags[61:0] = 00000001 00000011   (after mask)
          ErrFlag[0]: Correctable ECC Error   (CE)  Processor 0 Dtags
          ErrFlag[32]: Repeated Error
      Proc 0 Dtag ECCSyn[ 5: 0] = 07:  CE: bit 03  Dtag SRAM 7.2.0
(output omitted)

The above example is an Uncorrectable ECC Error in the DTags which are connected to the CIC (7.2 in this example). Any Uncorrectable DTag Error which results in an Arbstop such as this, the Implicated FRU (System Board 7, in this example) should be replaced immediately.

It should be noted that the CIC will always report a Correctable and an Uncorrectable Error, when an Uncorrectable Error condition exists. This is normal behavior for the CIC, and the Correctable Error can be safely ignored.

NOTE: Blacklisting a CPU (proc 7.0 in example) is an effective short-term workaround until the System Board can be replaced at a convenient time for the customer.

wfail reports unexpected foreign PIO queue p_reply received

From the wfail output, we see something like the following:

redxl> wfail
(output omitted)
PC    1.2   ErrFlag[1:0][31:0] = 00000000 00000200   (after mask)
        ErrFlag0[9]: Port 0 unexpected foreign PIO queue p_reply received
                ErrorData [12:0] = p_reply[4:0], 4b'0000, SlaveState[3:0]
FAIL IOC 1.0: Arbstop detected by pc.
(output omitted)

Foreign PIO errors are generally reported by sane hardware. The reporting hardware is a victim of some other bad component. The source of the error has no clue an error occurred, therefore no error information.

The arbstop does not provide any further data to go on. But there some things to follow up on.

Execute a heavy hpost, either -l32 or -l64, on the domain
Look for other indications of a failure in the messages file and other logs

wfail reports Timeout waiting for data to match address

wfail reports Timeout waiting for address to match data

wfail reports MC Timeout

Timeout arbstops are characterized as having one or both of the following error messages within the redx wfail output:

Example 1: MC Timeout: waiting for data to match address

Example 2: MC Timeout: waiting for address to match data

or both on successive lines of wfail output:

Example 3: MC Timeout: waiting for data to match address
MC Timeout: waiting for address to match data

Facts about Timeouts

1. The vast majority of these errors have historically been either the waiting for data to match address instance (1st example) or the instance where both errors appear in the wfail output (3rd example).
Only a single instance of the 2nd example, waiting for address to match data, without a corresponding waiting for data to match address error, has been known to have occurred. It was a readily reproducible error during hpost execution, a rare exception for these errors. However, to ensure the highest probability that this problem will be resolved quickly, easily, and with minimal impact to the system, the steps used should be applied for all of the 3 instances above.

2. There are several possible root causes for these errors. It is also quite possible that there are more, unidentified root causes which are not accounted for in this document. Consequently, this paper is designed to allow the user to maximize their chance for successfully correcting the problem, by investigating all known root causes, and addressing each in turn. Special care should be taken to complete all steps outlined in the Troubleshooting Procedure section, even if ample evidence suggests that the actual root cause has already been corrected. This will reduce the chances of a problem return, which would otherwise further undermine customer confidence. Once the Troubleshooting Procedure is completed, the only remaining possible root cause for these errors would be unidentified software bugs.

3. This document provides the user with convenient stop and restart points for those customer instances where the error frequency is high enough to ascertain whether a specific action provided the principal resolution.
Historically, these errors have been too intermittent to fall in the "readily reproducible" category.

4. This document strives to identify and describe all known root causes for MC Timeouts.

5. It is a basic assumption that the machine experiencing these errors is housed in a stable environment. Consistent power, temperature, and cleanliness of the machine is important for any computer stability. Although there has never been an established link between an environmental problem and MC Timeouts specifically, an environment which is not stable can cause failures in any conceivable way. Should these errors arise in an unstable environment, these issues must be addressed in parallel with the Troubleshooting Procedure.

6. Although these errors are reported by the Memory Controller (MC) on a System Board in the domain, the MC and the System Board it resides on are generally victims of the error. Therefore, any conclusion which recommends an action which includes replacing and/or removing the reporting System Board should not be done until all other possible resolutions have been exhaustively investigated.
The wfail redx output will also display the following paragraph:

MC Timeout: The reporting MC is most likely a victim of
a transaction dropped by other hardware in the domain,
which did not detect any error. The MC reporting the
timeout is not likely to be the cause of the problem.

However, this updated message will not be displayed by older versions of SSP software, SSP3.0, SSP3.1, SSP3.1.1, or SSP3.2, as these versions have not been patched to do so. All SSP patches are considered mandatory for all releases of SSP software.

7. Both software and hardware are root causes for these errors. This document attempts to describe and allow you to eliminate all possible root causes, hardware and software.

8. The nature of arbstop dump files provide us with a frozen snapshot of the E10000 crossbar. But, since the crossbar is frozen due to a fatal error, no other activities will succeed after the arbstop. This results in an inability to collect data from memory, such as a panic dump. We are also unable to collect data from CPU states, such as we do for heartbeat or watchdog failures. The lackof such information makes identifying root causes for arbstops solely dependent on data in the arbstop dump file. For MC Timeouts, the data needed to identify root cause is not captured in the arbstop dump file, principally because of the nature of the MC Timeout failure. MC Timeouts occur during store operations. Store operations succeed only after an address and data are transferred to the Memory subsystem. The MC (Memory Controller) is responsible for coordinating these store operations. The address and data are transferred to the memory subsystem asynchronously. The address is sent on one of four Address Buses.
The data is transferred via the point?to?point Data Crossbar. These errors occur when one transaction, usually the data transaction, does not complete within a certain period of time after the initial transaction, usually the address, is received by the MC. Because this timeout period exceeds the 20?cycle deep FIFO of the crossbar history recording registers, the information required to identify the initiator of the failed store operation, who is the primary hardware suspect, is not captured in the arbstop dump file.

9. Since these errors are dependent on sound behavior of UPA devices (which initiate stores), UPA devices are the principle hardware entities suspected to cause these problems. These include CPUs, I/O Controllers (SysIO/Psycho+), and I/O Interface cards (UDWIS, etc...). Furthermore, I/O drivers are suspects for software induced errors and I/O card FCode or firmware would be a suspect for any MC Timeout which occurs during the boot process.

10. Other platforms (EX500 servers) do not have the arbstop functionality.
Evidence suggests that errors which propagate as MC Timeout arbstops on an E10000 domain, generally will propagate as "Data Access Exceptions" on the other UPA?based servers. Consequently, any software known to cause Data Access Exceptions on those servers, would be suspected as possible root causes for MC Timeouts on an E10000 domain. Hence, software patches for any drivers or firmware which correct "data access exceptions" should be considered as possible fixes for these errors on an E10000. However, we do not have any proven instances where a specific patch has been categorically proven to fix MC Timeouts.

Troubleshooting Procedure

Step 1: Gather customer data, including:
A. SSP Explorer
B. Domain Explorer
C. Reference recent Radiance cases for the platform.

Step 2: Verify SunVTS
A. Was SunVTS running at the time?
What version of Solaris?
What version of SunVTS?

B. Verify SunVTS version is compatible with Solaris Release.
The version of SunVTS running on a given domain should always be the version shipped with the version of Solaris running on the domain.

Step 3: Check kernel and I/O driver patch levels.
Examine the showrev -p output on the domain.
Compare to the list of currently available patches for the Solaris Release being run to those installed on the domain. Check for current patches which might solve MC Timeout errors or Data Access Errors on smaller servers. Install appropriate patches based on this comparison.

Step 4: Verify all SBus and PCI I/O cards are updated.
Specific to MC Timeouts, upgrades the firmware on the UDWIS card from 1.25 to 1.28.
Executing a "grep -i fcode" from the domain explorer's prtconf-vp.out file will reveal the various firmwares of the different I/O interface cards.

Step 5: Check for unsupported HW configurations.

A. All Memory Mezzanines should be configured with 2 or 4 banks of memory. Operating with 1 or 3 is relatively safe, but should only be done on a temporary basis while waiting for a maintenance window to swap a bad DIMM. Memory mezzanines should not be configured with 1 or 3 physical banks for any extended period of time. This is not supported. There is circumstantial evidence that 1-way interleaving (which is used when only 1 or 3 banks exist) may increase the frequency of MC Timeouts.

B. All SBs must have at least 1 CPU.
This requirement really stems from adequate testing of hardware ASICs during hpost execution.

C. Note any 3rd Party I/O Interface cards
    1. Is this a unique card (not in other E10Ks)?
    2. It could also be a bad card (arrange for swap)

Step 6: Identify any other recent domain failures which have occurred.

A. System panics

B. Arbstops which were not MC Timeouts
    1. CIC reported "Illegal Coherent Condition"
    2. CIC reported "DTag Parity error"

C. Powerpuck failures on System Boards
Platform specific message file has "power" messages

D. Recordstops
    1. XDB reported syndrome 03 cache parity errors
    2. XDB reported memory ECC errors

A large number of ECC errors within a short period of time have been known to generate an occasional MC Timeout.

Any of the aforementioned errors are considered indicative of a hardware error, which, given a different set of circumstances, could generate an MC Timeout instead.
Any UPA device (CPU, I/O IF card, SysIO and Psycho+ IOCs, as well as System Boards, are capable of logic errors which may generate an MC Timeout. Hence, 3rd Party cards must be considered candidates for swap.

Anytime an inordinate number of asynchronous interrupts are handled by a CPU, such as a "storm" of memory CEs, there seems to be a risk that the CPU could place the UPA Port Controller (PC ASIC in E10000) into a state that may result in an MC Timeout.
Since this particular phenomena is not yet completely understood, an investigation continues. It should also be noted here that Solaris has been known to exhibit strange behavior in instances where massive numbers of CEs occur. Solaris operation is more adversely affected when the location of the Correctable Errors are within kernel memory.

The examples listed above are primary examples, but the list should not be considered comprehensive or exhaustive. This step is based on the presumption that failing hardware can fail in different ways, based on the hardware state at the time of the failure. If we have an error which is diagnosable, it should be fixed. Assuming that the suspect hardware is a resource within the domain at the time of the MC Timeout, it is not unreasonable to assume the same hardware is the root cause for both errors.

Step 7: Replace indicated failing hardware.
Any UPA device diagnosed as suspicious or likely root cause of a different error can be considered a suspect for causing an MC Timeout. The suspect hardware should be replaced.

Step 8: Execute an "hpost -l64" against the domain..
Hpost execution at level 64 is the most effective diagnostic level which can be run on an E10000 domain. This level of testing provides the best opportunity for identifying suspect hardware.

Step 9: If hpost fails, troubleshoot the problem.

A. If a failure occurs, return to Step 7 and swap the indicated hardware, then run "hpost -l64" again.

B. Steps 7 and 8 should be run until hpost -l64 runs without error.

Step 10: Identify any recent system changes, hardware or software.

A. Any hardware swaps, upgrades, or add-ons must be evaluated.

B. Changes in kernel parameters in /etc/system, application parameters, SGA sizes, etc., should all be scrutinized, for complicity into an onset of MC Timeout arbstops.

Step 11: Reverse or repeat hardware changes.
Repeat HW swap if HW was recently swapped due to some previous Service Call. Reverse any upgrades recently completed which might account for the MC Timeout phenomena. Obviously, this step has some risk, since it could further undermine customer confidence should the error occur again.
Different customers must be evaluated based on their own requirements.
But the intent is to suspect any hardware recently installed into the platform which might be possible root cause for the MC Timeout errors. An "hpost -l64" should be completed after any/all domain HW changes.

Step 12: Reverse software changes.
This step has some risk, and CTE and PDE understand that the probability that a customer will allow the disabling of new software is unlikely. But, we list it here as a potentially viable step for troubleshooting purposes.

Step 13: Problem resolved?
Can any completed actions or changes, using ATS and/or SunResolve principles, be applied to reach a reasonable conclusion that the MC timeouts have been addressed? If any of the previous steps identified corrective actions which were carried out and completed, it is reasonable to assume that these completed actions may have addressed the actual root cause for the MC Timeouts. In this case, stop! Wait for another MC Timeout. Otherwise, proceed to Step 14.

Step 14: Not done OR Another MC Timeout occurred.
Executing this step suggests that either there has been no significant discovery suggesting that a fix may have been applied, or you have conclusive proof that the MC Timeouts have not been addressed.
Specifically, the customer has experienced another MC Timeout since completing Step 13. If so, proceed to Step 15. Otherwise, return to Step 1 of this flow, and reevaluate.

Step 15: Multiple Domains?
If this platform has multiple domains, proceed to Step 16.
If the platform has only one domain, proceed directly to Step 22.

Step 16: Enable Domain Transgression Errors.

A. This step will require that all domains be shutdown, in order to reconfigure the Centerplane.

B. Enabling Domain Transgression Error is accomplished by placing the flag "dom_transgress_err_enbl" in the appropriate.postrc file on the SSP (default: /export/home/ssp/.postrc).

C. To minimize inconvenience to the customer, the execution of the "bringup -C" can be accomplished against any domain. It does not have to be on the domain exhibiting the MC Timeouts.

D. Execute "bringup" on the remaining domains, as customer priority requires. The ability to detect a transgressing request is not affected by the order of domain boots. The DTE logic is autonomous.
Note: There is an increased exposure to Global Arbstops when the centerplane is configured with Domain Transgression Error enabled. The exposure is most likely felt in the event a DC-to-DC power converter on a System Board fails. It needs to be made very clear to the customer that a Domain Transgression Error (DTE) is not a Global Arbstop when one occurs.
It will not affect any other domains running at the time of a DTE. Only the domain of the "source" System Board, in a DTE scenario, is affected.
Centerplane logic always inhibits transgressing requests; it just does not report them as errors if/when DTE is not enabled.

Step 17: How long do we wait for a DTE to occur?
There is no single answer for this question. We suggest that an appropriate amount of time would be either "2 times greater than the longest interval between MC Timeouts which the customer has experienced" or "4 times the average length of time between MC Timeout errors". The time should be established with the customer before it is actually implemented. Customer expectations need to be set in a manner both consistent with their expectations, as well as consistent with the objective of experiencing an error which is far easier to troubleshoot. If this wait time is exceeded without a DTE or an MC Timeout error recurs with DTE enabled, proceed directly to Step 22.

Step 18: A DTE event occurs: How do we diagnose it?
An example wfail output from redx shows a DTE message:

....
GDARB 0 Domain Transgression Error Mask [15:0] = 0002
FAIL LDPATH 1.0: Arbstop detected by gdarb.
GDARB 1 Domain Transgression Error Mask [15:0] = 0002
FAIL LDPATH 1.1: Arbstop detected by gdarb.
....
The example reflects an error which was "sourced" by SB1, as evidenced by Bit 1 in the Mask. This would indicate SB1 or a UPA device on SB1 is responsible for the error. Note that the Target Board where data from SB1 was "aimed" at, is not reported or affected by the failure.

Step 19: Replace indicated hardware.
It is our conclusion that the System Board cannot be identified as a lone likely root cause. Logic within a UPA device, such as a CPU, I/O card, or the I/O Mezzanine, could also be the actual root cause. Therefore, in the event a DTE occurs, it is recommended that the SB, CPUs, I/O Mezzanine, and all I/O Interface cards on the identified SB be swapped.
Once all of these FRUs are replaced, retest and wait for another failure.
Once satisfied that the problem has been addressed, remove the "dom_transgress_err_enbl" flag from the .postrc file (from Step 16) and schedule a maintenance window to run "hpost -C" which will effect the changes to the Centerplane configuration. This can be done at a time convenient to the customer, but mindful of the increased risks.

Step 20: Close Call.
Once all the indicated hardware has been swapped, wait for another failure.
The wait period should be either "2 times greater than the longest interval between MC Timeouts which the customer has experienced" or "4 times the average length of time between MC Timeout errors", whichever is greater.

Attachments

This solution has no attachment